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Clusters for Rapid Artist-Audience Matching 

A portion of the disclosure of this patent document contains material which is subject to 
copyright protection. The copyright owner has no objection to the facsimile reproduction by 
anyone of the patent disclosure, as it appears in the PTO patent file or records, but otherwise 
5 reserves all copyrights whatsoever, 



Cross Reference to Related Applications 

This application claims priority under 35 U.S.C. Section 1 19(e) from United States Provisional 
10 Patent Application 60/165,794, filed November 16, 1999. The entire disclosure thereof of the 

above-enumerated United States Provisional Patent Application, including the specification, 
I drawings, claims, and abstract, are considered as being part of the disclosure of this application 
^ and are herby incorporated by reference. 

Ji5 Brief Summary of the Invention 

^ The purpose of this invention is to facilitate the existence and rapid growth of a Web site (or 
I other form of electronic service) that will distribute entertainment works to their audience more 
^ effectively than other techniques. 
^20 

Online services based on this invention will 

• Enable artists and entertainers to more efficiently find the consumers who will most enjoy 
their works 

• Enable consumers to more efficiently find artists and entertainers they will enjoy 

25 • (In some embodiments) Enable consumers of similar tastes to enjoy discussions with each 
other, and, in some embodiments, to interact with artists. 

• (In some embodiments) Enable individuals to play an entrepreneurial role in connecting 
artists to their audience, wherein they may be paid for their success. 
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• (In some embodiments) Enable consumers and artists to enjoy the benefits of group buying: 
more sales to the artist and lower cost to the consumer. 

Detailed Description of the Invention 

5 

Clusters: The Heart of The Invention 

The centerpiece of this invention is clusters of users who have similar tastes. Users are enabled 
to find cluster that match their tastes, and artists are enabled to find the clusters where the users 
10 are who are likely to be interested in their work. Clusters thus serve as hubs of activity for 

particular tastes; in most embodiments ratings of items of interest to those tastes can be viewed, 
m and various embodiments include various means for inter-user communication so that 
communities of people with similar tastes are formed. 

^;15 Much of this disclosure will focus on music applications of the invention. However, this is 
J3 merely for convenience, and applications to other fields, including works in the fields of writing 
and movies, fall equally within the scope of the invention. 



20 User-Created Clusters 

In the some embodiment individuals are enabled to create new clusters whenever they choose. 
One reason for doing so is that they believe that there is a group of people which is not 
adequately served by any of the existing clusters - for instance, because the tastes of people of 
this group are substantially different from the tastes represented by any of the existing clusters. 
25 The person creation a cluster will be known in this description as the "cluster initiator". 

Means are provided for cluster creators to specify a taste of the cluster (which in some 
embodiments is later combined with taste information from other users as described elsewhere in 
this document). In one embodiment, he does so by specifying ratings for various items which he 
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feels will be useful in defining the taste of the cluster. For example, he might give a recording of 
Bob Dylan's a rating of .95 (on a scale of 0 to 1) and a recording of Handel's Water Music a 0.1. 

In another embodiment, he simply inputs a list of items which he feels will be the ones most- 
5 liked by members of the cluster. In many cases, these will represent his personal most-liked 
items. In a preferred embodiment, this list is ordered according to how well each item is liked 
relative to the others on the list. 

Some software and network systems such as Napster and Gnutella enable file sharing, where 
10 files stored in the user's computer are made available for other users to download to their own 
computers. Usually, the songs a user has on his computer - and thus can make available for 
sharing - correspond to songs the user likes. Thus, the list of files he makes available for 
download can usually be presumed to represent a list of his likes, and used that way in 
computations. In some cases, of course, users will make songs they don't like available to other 
15 users, but some embodiments view this as happening infrequently enough that such problems 
may be ignored. 

Then when users of the system are looking for a cluster that might be suitable for them, their 
tastes will be compared to that of this cluster as well as the old clusters. This new one may be the 
20 most suitable for a meaningful number of people. 

In preferred embodiments, each cluster will have associated with it various facilities such as 
threaded discussion groups, chat, instant messaging, etc. These facilities will keep the users 
interested and motivate them to spend more time online. Spending this time online will provide 
25 more opportunities for advertising as well as more conunitment to the cluster, increasing the 
probability of further online purchasing. 

In some embodiments individuals are responsible for "moderating" and "administrating" these 
various facilities. This group may contain the cluster initiator. In many cases one person will 
30 play all these roles. For convenience, this person or group will be referred to here as the "cluster 
administrator." 
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Means are provided in such embodiments such that administrators, visitors, and members can 
provide reviews of items such as CD's which are in the subject domain of the service. In some 
embodiments, only a subset of these classes of users are enabled to input reviews; for instance, 
administrators and members. 

5 

When an item has more than one review, the usefulness of the system is enhanced by means of 
presenting the reviews to users in an optimal order. In the preferred embodiment, this order is 
based on the similarity between the user reading reviews and the user who wrote a particular 
review. Reviews written by authors who are closer to the reader appear earlier in the list. 

10 

Alternatively, the order can be dependent on the similarity of the author and the tastes of the 
cluster through which the item's reviews have been accessed. 

% In addition, in some embodiments means are provided for users to rate reviews. The system can 

^15 use these ratings to determine the quality of a reviewer in general and/or the degree to which 

: each individual user likes a particular reviewer. These factors can be used together with the 

: similarity data, or in some embodiments, without it, to determine an ordering of reviews. For 

J example, in one embodiment orderings are generated separately based on taste and quality. 

Percentile rankings are then calculated for each review. The percentiles between the two lists are 

^ 20 averaged, and a new ordered list is created based on these average percentiles. 

- In some embodiments a summary of each reviewer's perceived goodness (determined by means 
of ratings or passive data acquisition such as measuring the average time spent reading a 
reviewer's reviews or the relative number of times a reviewer's reviews are sought out) is 
25 displayed. 

Particular embodiments can use any means to combine the above factors. For instance, geometric 
or arithmetic averages can be computed, with or without weights. 

30 In some embodiments, only a limited number of reviews are presented to users. These may be 
the ones that would have been presented first based on the ordering techniques described above. 
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In some embodiments where only the "best" reviews are presented, the order in which those 
reviews are presented may be in date-of-creation or another order such as random. 



In some embodiments, reviews are automatically (or optionally) posted to the Usenet or other 
5 publicly-available services with links back to the cluster service or the individual cluster through 
which the review was written. (In some embodiments, however, there is no particular association 
between reviews and individual clusters; rather the association is with the item or artist being 
reviewed, so that all reviews are available in all clusters.) 

10 In some embodiments means are provided so the administrators are paid by users for access to 
their clusters. In various embodiments, these payments are one-time-only, per time period (such 
as monthly) or per access. Reviewers can be paid similarly for access to reviews written by that 
reviewer. Credit card payment mechanisms or other techniques such as micropayments can be 
^ used. 

4; In addition, in some embodiments facilities are provided for purchasing items directly through 
2: the site (or through partner sites reached by hyperlink). In some embodiments means are 

provided for users to rate their satisfaction with such purchases, and in preferred embodiments 
display means are provided for users to see selected summaries of these ratings, such as the 
h^20 average rating for the cluster (including average ratings for reviews, product purchases, general 
satisfaction, and any other metric of cluster success). 

In embodiments where users are enabled to purchase items, preferred embodiments include 
means for the one or more of the cluster administrators to be paid a portion of the purchase price 
25 of the items. In various embodiments, this is a fixed percentage, a percentage that various with 
volume or total revenues, or other variations. 

In some embodiments advertising is used as an additional income stream. 

30 In some embodiments, means are provided to enable a "group purchase" to be made wherein a 
number of users will purchase an item at once, thereby getting a lower price from the provider 
for the item. For instance, in some embodiments, means are provided for users to indicate that 



5 



Express Mail EL 453 889 575 US 
Attorney Docket R49-009 

they would be willing to buy an item at a particular price. When enough users are wiUing to buy 
at a particular discounted price that the provider is willing to sell the item at that price, the 
transaction is carried through. In other embodiments, data is stored regarding the number of 
people who, having purchased an item by a particular artist in the past, and/or having reviewed 
5 or rated an artist at a particular level, were willing to buy a discounted item when such a deal was 
presented to them. This enables the system to predict an estimate of how many people are likely 
to buy a particular item by the same artist if offered at a discount. This enables the administrator 
to purchase a substantial number of copies at the item at once, at a discount, and to pass all or 
part of the savings on to purchasers. In some embodiments the software is able to automatically 
10 email all members of a cluster of such deals, or to screen its output to those who have bought 
items from that artist or very similar artists previously. 

In some embodiments, users are able to provide ratings of clusters. However, in preferred 
embodiments, more weight is given to ratings from people whose ratings have higher calculated 
15 "representativeness." (The concept of representativeness is discussed elsewhere in this 
document.) 

Automatically-Created Clusters 

In preferred embodiments, automatically-created clusters exist instead of or in addition to user- 
20 created clusters. Note that some embodiments have the automatically-created cluster features 

described in this section along with a limited number of the other features of the invention which 
are described in this disclosure, or none of them. Automatically-created clusters have their own 
value independent of the other concepts described herein. 

25 A technique for optimizing clusters based upon the principles of Shannon entropy will be 
described. Other techniques may be similarly applicable and also fall within the scope of the 
invention. 

Appendix B contains instructions for creating clusters that maximize information transfer as that 
30 concept is described in the literature of Shannon entropy. The related Hartley information 
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approach also contains information transfer calculations, and falls within the scope of the 
invention, but the Shannon approach is preferred. 

For completeness, Appendix C gives the Python source code to a methodology that does not use 
5 information transfer. It is based upon the standard k-means clustering concept. This example is 
included to illustrate the wide range of clustering approaches that fall within the scope of the 
invention; however, the preferred embodiment uses Shannon entropy's information transfer 
calculations. 

10 This disclosure uses the term "Automatically-Created" to refer not only to systems in which 

clusters are created by software without manual human intervention, but also to systems in which 
clusters are optimized by software. 

t In embodiments where clusters are actually created by the software, a preferred methodology is 
Jl5 for the administrator to set the average number of songs desired per cluster. As new songs are 
- added to the system, new clusters are automatically created such that the average number of 
j songs remains approximately the same; the optimization process then populates the cluster. 
I These clusters, in various embodiments, may start out empty before they are optimized, or may 
be initially populated with new songs or randomly chosen songs. 

I In order for the software to have data to base its optimizations on, user taste data must be 
3 collected. Some embodiments do this by means of allowing users to rate songs. Preferred 

embodiments do this by means of passive data collection. For instances, *.mp3 searches on the 
Gnutella network cause server to respond with a list of songs the user has made available for file 
25 sharing, which can be assumed, without too much error, to be a list of songs liked by that person. 
Radio UserLand does even better, broadcasting every song played by every user, allowing us to 
build a more detailed taste profile in a completely passive way. Various embodiments use 
various such means for data collection. 

30 Some embodiments only allow recommendations or cluster information to be sent to processes 
that send realistic-seeming user data to the server. (For instance, most such embodiments would 
consider a process that continuously reports playing the same song to be unrealistic.) 
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One challenge is to associate data sent by particular processes with user indentiflers that users 
can use to log on to Web sites. Preferred embodiments accomplish that by noting the IP address 
the user is accessing the Web site from, and seeing what passive data source, such as a Gnutella 
5 server, exists at the same IP address. In most such embodiments the user is then asked, via the 
Web interface, to confirm that he is using the type of data-broadcasting process that he is 
apparently using and asked whether the system has permission to link that data to his Web logon 
ID (or cookie, or other persistent identifier). In some embodiments, such as those involving 
passive data collection through Radio UserLand, a publicly available user ID for the data 
10 broadcaster is available, and that same user ID can be subsequently used by the user to log on to 
the Web site; the server can then easily link the data. 

Distributed Processing for Automatically-Created Clusters 

Preferred embodiments provide means for the computational load of the cluster calculations to 
15 be spread across more than one central processing unit. 

In some embodiments, this is accomplished by having completely independent processes running 
on the various machines which all interact with the data as stored in a database system such as 
the open-source InterBase product. Each process randomly chooses a song, then finds the 

20 optimal cluster to move it to. If, when it is ready to perform the move, a check in the database 
indicates that another process has already moved it, then it cancels the move; otherwise it 
updates the database. In embodiments where more than one write needs to be performed against 
the database to facilitate the move, these actions are normally put into a single transaction. Using 
this procedure, a large number of computers can work together to perform the optimization more 

25 quickly. However, a portion of the work done will be wasted because the song in question was 
already moved by another process. This portion will be greater as the number of processes 
grows. Therefore it is preferable to have a more centrally controlled model. 

Embodiments with centrally controlled models need an interprocess communication (IPC) 
30 method. Preferred embodiments use open standards such as XML-RPC and SOAP, since these 
enable clients to be written independently using a any of variety of programming languages. In 
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some such embodiments, a server process waits for registration messages from remote client 
processes. When a client initializes itself, it communicates with the IP address and port of the 
server process. The client logs on with a persistent logon ID, or the server gives it a session ID 
when it first makes contact. Then a portion of the workload is assigned to the client. 

5 

Various embodiments use various methodologies for portioning out parts of the work to the 
various clients. In one such embodiment, the client is sent all data needed to describe all the 
clusters via IPC. Then, it is assigned responsibility for a certain number of songs. It finds the best 
clusters for those songs. It sends that data back to the server, which subsequently updates the 
10 database. At various intervals, the cluster description data is sent again to the client, containing 
the results of the simultaneous work done by the various other clients. 

In some other embodiments, only the data for a subset of the clusters is sent to the client. 
Therefore, a set of clients is responsible for any particular song. Each client determines the best 
15 destination among the clusters it has the data for. Then the degree of goodness of the best choice 
is returned to the server by each client; the server determines "the best of the best" and updates 
the database accordingly. 

One danger that arises in distributed processing embodiments is that a malicious client will be 
20 created that interacts with the server exactly as if it were a legitimate client. To avert this 

problem, preferred embodiments keep track of the average improvement in cluster quality per 
song movement. (For instance, in embodiments based on information transfer, this is based on 
the improvement in information transfer that occurs due to the movement.) When a client 
suggests a movement, the improvement associated with that movement is calculated by the 
25 server. If a client's suggestions tend to involve significantly less improvement in quality than is 
the norm, the system has reason to believe the client was either not written correctly or may even 
be malicious (trying to move songs inappropriately for the benefit of specific individuals or 
organizations). 

30 The preferred embodiment accomplishes this by first storing the improvement per movement for 
trusted clients. These may be, for instance, clients running on the same machine as the server, 
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under control of the system administrator. As the client sends suggested movements to the 
server, the server determines whether the client is to be trusted. 

In the preferred embodiment, a client's suggestions are not added to the database until the client 
5 has had a chance to prove itself. The server waits until it receives 100 suggestions. The average 
improvement is calculated. This average needs to be within a desired range relative to the trusted 
average; for instance, an installation might desire that the client's average must be within 10% of 
the trusted value. If it is not, that batch of 100 suggestions is thrown away. Each batch of 100 
suggestions is tested separately, in case a malicious client tries to fool the server by being "nice" 
10 for a while, followed by malicious behavior. 

Other embodiments use other techniques for screening out malicious clients. In one such 
technique, the value of the last 100 suggestions is averaged for each client, and the clients are 
S subsequently ranked from least valuable to most valuable. These rankings are updated whenever 
/;^15 new suggestions come in from a client. The last 100 suggestions from the lowest-ranking 5% (or 
i: some other number) are always ignored. Still other embodiments calculate a Bayesian estimator 
2i of the value of the next suggestion. The suggestion is counted if and only if the Bayesian 
€1 estimator is within a specified range compared to the trusted client, for instance, within 1%. 
Other techniques are used in still further embodiments. 

p^20 

m One particularly simple approach, used in some embodiments, is for the server to simply check 
y that the preferred enabodiment increases overall information transfer, 

In some embodiments where there are so many clients that the central server does not have the 
25 processing power to check on all the clients, some clients are assigned with the task of checking 
that other clients are not malicious. In most such embodiments, the central server assigns these 
checking tasks in such a way that the assigned checking client is unknown to the possible 
malicious client, so there is no way for the clients to collude to trick the server. 
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Human Input with Automatically-Created Clusters 

Preferred embodiments do not rely only on software optimization of clusters. They allow users to 
suggest changes. These are only made if they result in an improvement to the clustering. 

For example, in one such embodiments, a Web page is made available where all the songs in a 
cluster are listed with checkboxes beside them. (If there are too many songs in a cluster to fit on 
one page, multiple pages are used. Most search engines such as Google provide fine examples of 
how to manage a list output when the list takes more than one page.) 

There is also an entry area whether user can enter an identifier for the destination cluster. In 
various embodiments, an identifying number of name may be entered, or there may be a pull 
down list if the number of clusters is small, or a more elaborate search mechanism is used. 

The user checks some of the checkboxes, specifies the destination, and indicates he is ready to 
continue (for instance, there may be a Continue button). 

The system then determines whether the suggested movement would improve the overall 
clustering. For instance, in embodiments which use information transfer to measure cluster 
quality, the information transfer that would result if the move were completed is calculated. If it 
is an improvement, the transfer goes through. Otherwise, it does not, and the user is informed 
that the transfer didn't go through. Preferred embodiments then let the user make adjustments to 
his suggestion; for instance, the song listing may be presented again with the checkboxes in the 
state the user left them in. He can then make alterations and click Continue again. 

In preferred embodiments, the user can create a new cluster as the proposed destination. It would 
then be initially populated with the songs he selects, if doing so increases the quality of the 
clustering. Many such embodiments provide a user interface whereby the user can enter songs to 
populate the new cluster with, without regard to there original clusters. In most such 
embodiments the administrator can set an upper limit on the number of clusters that may be 
created in this way. 
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The embodiments discussed here thus join human input with computer optimization in such a 
way that the human input is smoothly integrated into the process. All accepted human input 
furthers the aim of improving the clustering. 

5 Names for Automatically-Created Clusters 

Preferred embodiments provide input means for users to name automatically-created clusters. 

In one such embodiments, a page is presented in which there are 20 text input areas, each 
providing enough space to enter a name. When a name is entered into one of the text areas (and 
10 Submit is clicked), the name may not be removed except by an administrator for a period of one 
week. Next to each name is a set of radio boxes labeled "no opinion, poor, fair, good, excellent". 
Users can thus rate any or all of the names. User identification is carried out by means of a logon 
requirement, cookies, or other means; only 1 vote per user per name is allowed. 

15 An overall rating for each name is determined by means of averaging the ratings, ignoring ''no 
opinion" ratings. 

After a name has been displayed for one week, if it is not among the top 50%, it is deleted, and 
any user can enter a new name. 

20 

Only one name per user at a time is accepted in the list. 

At any point in time, the highest-rated name is used as the name of the cluster, displayed 
wherever it is convenient to display such a name. In many embodiments a cluster number is also 
25 displayed, which is constant over the life time of the cluster, and therefore may be useful when a 
reliable way of identifying a cluster is needed. 

User-Cluster-Similarity 

In most embodiments, means are provided to compute a similarity between a user and a cluster. 

30 
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In the some embodiments, users provide ratings that represent their tastes. In other embodiments 
purchase histories are used. In other embodiments, "passive" data collection such as tracking the 
artists and recordings that are dov^nloaded and/or listened to can be used. In general, any source 
of information which captures the user's preferences in the target domain is acceptable; this 
includes taking note of the music files made available for Napster, Gnutella, or other types of file 
sharing 

In some embodiments, the "taste of the cluster," its "taste signature," is defined wholly by the 
administrator; in others it is defined by the members or even the visitors to the cluster, or by a 
combination thereof. The taste signature is stored in a database on the server. In some 
embodiments it takes the form of a list of artists or items considered to be "liked" by the cluster; 
in some embodiments this list is ordered with the most-liked items appearing at the top; in some 
embodiments ratings are associated with items and artists, for instance, on a scale from 
"excellent" to "poor". 

In each of these embodiments, where data from various users are combined to form the taste 
signature, appropriate means are used. For instance, where ratings are used, the ratings for 
various items and artists are averaged; in some such embodiments, a weighted average is used 
with the administrator having a greater weight than other users. In embodiments where ordered 
lists are used, means for combining include converting the lists to percentile rankings, averaging 
the percentile rankings for each album, and outputting a new ordered list in order of the averaged 
percentiles. 

When a users wants to make use of the system, he usually does so by finding clusters of taste 
similar to theirs and, in preferred embodiments, with other positive characteristics. 

In preferred embodiments, means are provided to display a list of clusters together with 
descriptions of each cluster supplied by the administrator. These descriptions, in various 
embodiments take the form of text descriptors such as "Jazz, with a focus on old-style 
Dixieland", categories such as "Jazz", "Garage Rock", etc. or other means of communicating the 
center of the cluster. 
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In preferred embodiments, means are provided to search for clusters which users can believe 
they will be interested in. In embodiments where categories are provided, users can pick a 
category. In some embodiments where text descriptions are provided, users can search through 
these descriptions using standard text-retrieval techniques in order to find clusters relevant to 
their tastes. 

In preferred embodiments, users can specify their personal tastes, and the system automatically 
lists clusters where the taste signature of the cluster is near to the taste signature of the user. 

In preferred embodiments, when lists of clusters are presented based on any of the search 
techniques mentioned above, or other search techniques, the attributes mentioned above such as 
category and similarity to the user viewing the list are displayed, as may other cluster attributes 
which apply to the given cluster. 

In some embodiments, "passive" data collection methods are used in matching clusters to users. 
These methods involve no special input of data indicating tastes. 

In some such embodiments in the field of music, customizable Internet "radio" stations are 
associated with some or all clusters. Such stations play a mix of recordings using TCP/IP, 
multicasting, and/or other protocols to send streaming audio data (with additional video in some 
cases) to the user's computer where it is converted into sound. The recordings which are of the 
most interest to a cluster will tend to be played most often; the recording of least interest to the 
cluster, while still being "liked" by the cluster, will be played least often. Play rates can be used 
to tabulate ranks for items. In some embodiments, rank data is compiled for artists instead of, or 
in addition to, items. In most such embodiments, the administrator determines the play lists and 
relative frequency of playing various artists and cuts. 

This rank data is then used for searching, whether acquired through manual user action or 
passively. In some embodiments, users input their favorite artists (or recordings, depending on 
the embodiment) in order of preference. In one embodiment, rank correlation is then used to find 
the closest matches, by computing the rank correlation for each cluster in turn and then picking 
the ones with the greatest level of correlation. In preferred embodiments, further processing is 



14 



Express Mail EL 453 889 575 US 
Attorney Docket R49-009 

done to calculate p-values relative to the rank correlations, and the p-values closest to 0 indicate 
the closest match. (This is preferable because p-values seamlessly incorporate the number of 
artists or items in common on the lists being matched, as well as the degree of similar ordering.) 

5 In other embodiments, other means are used to measure taste similarities based on this data. In 
some embodiments, for instance, rank data is converted into "ratings" data by dividing the 
rankings into groups and considering the items (or artists) in the group of highest ranks to have 
the highest rating, the items (or artists) in the 2"'*-highest group of ranks to have the second- 
highest rating, etc. (There are an equal number of groups of ranks to the number of ratings; for 
10 instance, if there is a 5-point rating scale, one embodiment would assign the top 20% of items [or 
artists] to the highest rating, the next 20% to the next highest rating, etc.) Then rating-based 
techniques such as those described in US Patent 5,884,282 for measuring similarity are used. 

S In some embodiments, other types of data than rank or ratings data are used. For instance, in 
?'~15 some embodiments, simple counts of the number of items (or artists) in common on the two lists 
4S are used; a higher number means more similarity of taste. It should not be construed that this 
invention depends on the use of any particular type of this "taste signature" data. 

In embodiments where we have only "presence/absence" data available, such as a Napster file 
f^20 list in which a particular song is either present or absent, a variety of calculations can be used. 
01 While the invention should not be construed to be limited to any particular calculations, several 
will be listed for purposes of example: the Ochiai, Dice, and Jaccard indices. In calculating these 
indices, some embodiments consider the entire list of songs to be the combination of all songs 
contained in either the cluster in question or the user's liked list. The presence and absence are 
25 determined corresponding to this expanded list. Some other embodiments consider the master list 
to be the list of songs liked by the user; other songs are ignored. Thus in such embodiments the 
user only has "presence" indicated; whereas the cluster will usually have a mix of presence and 
absence. Other embodiments do the reverse, taking the cluster's list to be the master list. Some 
embodiments further calculate statistical significances with respect to such indices, by making 
30 use of the statistical distribution of the used index (Snijders 1990). In all these cases a number is 
calculated which corresponds to the degree to which the user's list of songs matches the clusters' 
list of songs. 
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In some embodiments, passive data collection is done on the user side, in addition to, or instead 
of, doing so on the cluster side. In some ernbodiments, for example, use is made of the fact that 
users often have MPS, CD, streaming audio, or other types of music players on their machines. 
Such players can be adapted by their programmers (and, in the case of open-source players, by 
any competent programmer) to store playback-based taste-signature data similar to that described 
for customizable Internet radio stations. In some embodiments this data is stored on the user's 
computer; in others it is stored on a central server. As noted earlier, lists of files made available 
for Napster, Gnutella, or other file sharing may be used. As before, rank correlation or other 
means, depending upon the embodiment, are used to determine the most appropriate clusters. 

In some further embodiments, recommendations generated by clusters are integrated directly into 
the user interfaces of the users' players. For example, in some embodiments the software 
residing on the server is sent the playback data for a user, finds the most appropriate cluster, and 
sends the player software a list of the most highly-rated recordings. These reconmiendations are 
made available to the user (in one embodiment, by means of a pull-down menu; in another, by 
means of a scrolling list; in other embodiments, by other means) and the user can then choose the 
one he wants to hear. In various embodiments additional information may be included in the 
display, such as the name of the artist, the length of the song, etc.; in some embodiments, it is 
possible to click on a feature and be transported to a World Wide Web page with information on 
the recording. 

In some embodiments, the user's player is sent the taste signature data associated with the 
various clusters and makes the decision about which cluster is most appropriate. This lessens 
privacy concerns because no complete record of a given individual's tastes needs to exist on the 
server to faciUtate the recommendation process. 

In some embodiments, the methods described here and other methods are used to measure 
similarities between individual users. For instance, in some embodiments these similarities are 
used to determine the order in which a user views reviews written by other users; the ones by 
users with the most similarity to the user reading the reviews are displayed first. 
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Cluster Membership 

In preferred embodiments, users can become members of clusters. In some embodiments, 
members of clusters are given special access to certain facilities like chat rooms and discussion 
boards. In some embodiments they are given special pricing considerations when making 
purchases. 

In typical embodiments, cluster members are known to the system by a logon ID and password. 
Members can join a cluster they are visiting by indicating that they wish to join; in some 
embodiments this is accomplished by checking an HTML checkbox. 

Goodness List 

In preferred embodiments, a goodness list is associated with each cluster. This somewhat 
corresponds to the top-40 song lists from decades ago. 

Because a typical system might have hundreds or even thousands of clusters, the goodness list 
associated with each cluster will be highly targeted to particular tastes. 

In some embodiments, manually entered ratings, supplied by the users, are averaged or otherwise 
combined to form the goodness score, and songs are listed in order of score. 

In preferred embodiments, the necessary data is collected passively. In preferred embodiments, 
this data includes the number of times each user plays each song. Players or file sharing 
processes communicate their passively collected data to the server by using such common 
interfaces as SOAP, XML-RPC, or others. 

At the time this disclosure is being written Radio UserLand broadcasts this data for its users by 
means of XML and any process that wants access to it can do get it by reading an XML file at a 
particular IP address. Radio UserLand broadcasts the time each song is played by each user; this 
data can be compiled to obtain a frequency of playing for each song. 
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Preferred embodiments use such data as follows. For each user: 

• The number of times he has played each song in the last week (or during some other chosen 
time period) is computed. (Over the entire population, this results in one count per user per 

5 song.) Songs he has not played during that period are ignored in all following steps. 

• The user's played songs are ranked with respect to one another according to the number of 
plays. 

• A number between 0 and 1 is assigned depending on rank, in increments of 1/N, where N is 
the number of songs played at least once by the user. The most frequently played song has a 

10 ranking of 1, the least, a rank of 1/N. We will call these "unit ranks". 

Then, for each song: 

r% • The geometric mean of the unit ranks is computed. This is done by multiplying the unit 
^ ranks, and computing the Mth root of the product, where M is the number of unit ranks that 

|=Il5 were multiplied. This geometric mean is considered to be the "goodness" of the song. 

m The number computed for each song as described above has two main advantages over other 
J" known approaches: 

j"- • Because of the ranking process, a particular user who tries to maliciously skew the process 
L:.20 by playing a particular song an overwhelmingly huge number of times does not end up 

22 having any greater effect than another user who played the song only a little more frequently 

Cj than other songs. 

• By using the geometric mean to compute the goodness, the songs with the highest goodness 
values are the songs that most consistently achieve high play rates among users who have 

25 heard them. This consistency is important, because our aim is to create a goodness list that is 

very reliable. Ideally, a top-ranked song in the goodness list of a cluster will be very likely to 
appeal to everyone who feels an association to that cluster. Geometric means accomplish that 
aim. 

30 Some embodiments take the geometric mean methodology a further step, and treat the ranks as 
p-values. These p-values are with respect to the null hypothesis that the song has no particular 
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tendency to be ranked above average compared to other songs. Then, the product of these p- 
values has an approximately chi-square distribution with 2M degrees of freedom. So, instead of 
taking the Mth root, we use the chi -square distribution to calculate a resultant "combined" 
confidence level, represented by another p-value. This resultant p-value can then be used as the 
5 goodness. Under this goodness measurement, the songs with the highest goodness would be even 
more reliably liked by a user with an affinity for the cluster than using the geometric mean 
method. 

The problem with the chi-square method is that songs with a lot of people hearing them tend to 
10 generate better confidence levels, because there is more data to generate confidence from. This 
prejudices that goodness measure against new songs that few people have heard, even if they 
play the song extremely frequently. 

3 However, in some embodiments, it is still considered worthwhile to display the chi-square-based 
h 5 goodness, to be as confident as possible that the top-ranked songs will be liked by nearly anyone 
; who hears them, even though some even better newer songs will not get the attention they 
I deserve. 

In some embodiments, more than one goodness list is displayed, usually along with text 
^'20 describing the advantages and disadvantages of each one. For instance, once such embodiment 
1 displays the chi-square-based list with the heading "Old Reliable - You' U Be Sure To Like The 
S Top Listed Ones Here!" and displays the geometric-mean-based ones with the heading: "Hottest 

of the Hot - The Top Ones Tend To Be Newer Songs Loved By Everyone Who' s Heard Them!" 

25 Some embodiments display other measures, some of which are more akin to traditional 

popularity measures, such as ranking the songs according to the number of people who have 
heard each one or the total number of plays it has received. Some embodiments display such 
numbers with the data restricted to users associated with the cluster; some do so over the entire 
population. Any combination of measures can be displayed. 



30 



In general, any measure that conveys the degree to which a song is popular or liked can be used. 
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These measures are often most valuable when the input data is restricted to members of the 
cluster for which they are being displayed. For instance, someone who loves serious, literary 
folk music may dislike all disco music. If for some reason he downloads a disco song and plays 
it once, he probably wouldn't play it again. But that should not cause the song to have a low 
goodness in lists that are displayed in a cluster that appeals to disco lovers. 

Note that in some embodiments, there is no time window for the data to be considered by these 
calculations; in others older data is given less weight according to a decreasing scale, such as 
using half-life calculations for the data based upon the exponential distribution. (Given a chosen 
half-life, such as 30 days, one can compute the decay for any point in time using the exponential 
distribution. For our example, 30 days would have a decay of .5; days less than 30 would have 
decay values between 1 and .5; days greater than 30 would have decay values between .5 and 0.) 
This decay is an appropriate weight for the data points. If arithmetic averaging is used, the decay 
for each ranking is multiplied by the unit ranking. If geometric averaging is used, the unit 
ranking is used as a power for the ranking. Other decreasing scales may also be used. Different 
lists may have different scales. For instance, an "Old Reliable" list may have a window of one 
year, or include all relevant data ever collected, and a "Hottest of the Hot" list for the same 
cluster may have a window of one week. 

Radio 

In some embodiments each cluster broadcasts its songs as many services on the Web broadcast 
songs using such formats as streaming mp3 and Real Audio. In some embodiments the 
administrator of a cluster can turn this feature on or off for a given cluster. 

All-You-Can-Eat Services 

At the time of writing of this disclosure, many people in the Internet industry believe that a time 
will come when users will be able to access any song they want at any time, and either download 
it or play it in a streaming manner. Napster enables anyone to download any of a very great 
number of songs at any time for no charge, but its legality is in question because record 
companies and artists are not being reimbursed. It is generally expected in the industry that paid 
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services will shortly come into existence that give users similar benefits as those provided by 
Napster today, but legally. It is usually envisioned that a flat fee will be involved, akin to a 
monthly cable television bill. Cable TV is all-you-can-eat in the sense that for one fixed fee, the 
consumer gets to watch as much TV as he wants. The expected network-based music services are 
expected to also be all-you-can-eat in the sense that users can have access to as much music as 
they want for a fixed fee. 

A marketplace may evolve in which artists charge such services different amounts based on their 
popularity. A relatively unknown artist might charge less than a better-known artist. 

The service described in this disclosure can, in such a marketplace, be of use to all-you-can-eat 
services because the goodness measures can be used to determine who is good, regardless of the 
present popularity. Thus, an all-you-can-eat service can save money by marketing relatively 
unknown, but good, artists to its users; the more the users choose to download or listen to lesser- 
known artists, the more the service saves. 

Recommendations 

In some cases, users will not want to have to think about clusters. They will simply want 
recommendations of songs. 

Elsewhere in this disclosure means of measuring user-cluster-similarity are discussed. 
Recommendations are then made, in some embodiments, by finding the cluster(s) most similar to 
the user, and recommending the best songs in those clusters, according to the goodness measure 
used by the particular embodiment. 

For instance, in some such embodiments, means such as the Ochiai presence/absence index are 
used to calculate a user-cluster similarity number where a higher value means more similarity, 
and a goodness calculation within each cluster is also performed, such as using the geometric 
mean of unit ranks, where a higher value means more goodness. The two numbers are then 
multiplied; we will call the result the recommendation priority. Recommendations are 
subsequently made in descending order of the recommendation priority. 
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If it is desired to give more weight to one of these factors, it can be taken to a power. The power 
can be tuned over time. One way to do that is to try different values, assigning each value for a 
significant period of time, such as a month. The more appropriate the power, the higher the 
following number will be: The average of (the recommendation priority times some passive or 
active measure of how much the user likes the song). For instance, in embodiments involving 
Radio UserLand, for each recommended song that the user has not heard before, we multiply the 
number of times the user actually plays it in the first week after receiving the recommendation by 
its recommendation priority, and compute the average of those numbers. The higher that average 
is, the better the weight is. After trying a number of weights over a period of time, the best one is 
chosen. 

Other ways of combining the two numbers for calculating the recommendation priority are used 
in various other embodiments, such as adding them; and in still further embodiments, other 
methods are used, such as only picking one cluster for recommendations and then ordering them 
by goodness. 

Artist Tools 

Items may be submitted by artists for examination by cluster administrators, possibly leading to 
ratings, reviews, or other consideration. In some embodiments special forms, such as Web form 
input, are provided for this purpose. 

In preferred embodiments, means are provided to give artists some control over their "persistent 
reputations" as determined by ratings and reviews. In some such embodiments artists are given 
means to specify the clusters that may request or display reviews and ratings of their works. In 
further embodiments, clusters that cannot display or request reviews for an artist cannot receive 
submissions from him. 

In order to assist artists in directing their submissions to appropriate clusters, preferred 
embodiments provide special tools. Preferred embodiments use taste-based searching. In one 
such embodiment, a form (such as a Web input form) is provided which allows an artist to list 
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similar artists. The clusters with most-liked-lists with the most artists in conamon with the artists' 
list are the best ones to submit to. In a further embodiment, these similar artists are listed in order 
of similarity. The rankings are then matched against the cluster's rankings on their ranked most- 
liked-Usts using rank correlation. In still another embodiment, artists rate other artists regarding 
their similarity, and the cluster stores ratings of artist according to their perceived goodness. The 
scale may be for instance, a 7-point scale from "Excellent" to "Fair" in each case; although in 
one case similarity to a given artist is measured and in another case "goodness" seems to be 
measured, in fact the "goodness" measure is really similarity to the tastes of the given cluster. So 
the clusters with the best matches on these ratings are the ones to submit to in that embodiment. 
In general, the various techniques mentioned earlier for enabling users to find appropriate 
clusters may also be used for artists, including deriving lists of songs from the files made 
available by the artist for file sharing via Napster, Gnutella, or other means, and/or using 
presence/absence indeces. It should not be construed that this invention is limited to any 
particular means for taste-based searching. 

In some embodiments, artists are given means to indicate that they wish to pay a particular 
individual to listen to, rate and/or write a review of their work. In some further embodiments, 
they can read the review and decide whether it is to be displayed online. In some embodiments, 
means are provided such as online credit card payment or checking account withdrawal through 
which the individual reviewer can be paid for doing the rating/review. In order to help the artist 
decide which user to ask for a rating and/or review, users (who may be Administrators or other 
users), each have information available online which would help to indicate their suitability. 
First, if they are members or administrators of relevant clusters, that provides a first level of 
filtering indicating that their tastes are probably consistent with the interests of the artist. In some 
embodiments, previous reviews by the user are available in one easily-accessed list. In addition, 
in some embodiments, if the user has entered his own ratings or explicit or implicit list of most- 
liked-artists, whether ordered or unordered, the artist can use his own similar information (with 
regard to similarity of various works to the artist's own work or simply with regard to the artist's 
own personal likes) to request that the system generate a calculated indicator of appropriateness, 
similar to that used when users are searching for appropriate clusters. In some embodiments 
artists can search for appropriate users using these means without consideration of clusters. 
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Features are provided for helping the artist made infornaed choices about which users to submit 
their items to for review. In some embodiments, artists are given input means to rate users on 
their satisfaction with the ratings and reviews they have paid for. Other artists can see summaries 
of these ratings, for instance, in one embodiment, averages of the ratings, in order to judge who 
to pay. (For instance, a reviewer may write a negative review but not make it public, and make it 
a useful critique of the work, which the artist can use in refining his work in the future; such a 
review might be negative but still valuable.) In some embodiments, users can set their own fees 
for reviewing and/or listening. 

In addition, in some embodiments, a rating reliability number is calculated for users. This allows 
artists and other users to know how reliable a particular user's ratings are, helping artists judge 
whether to submit their items for rating and review by a particular user, and helping users decide 
which other users' ratings and reviews to read. See Appendix A for more detail. 
Preferred embodiments, information is not given to the artist that will enable him to choose 
reviewers who only review highly. For instance, a preferred embodiment only enables artists 
access to each reviewer's reliability data and cluster membership. Artists will then be motivated 
to pick reliable reviewers, as compared to reviewers who often disagree with the majority, but 
they will not have a means to prodict reviewers who only rate highly. Of course, in such an 
embodiment, an identifier for a reviewer that would enable the artist to associate him or her with 
particular displayed reviews would not be made available. 

In a preferred embodiment, the system keeps track of the songs a user has been paid to listen to. 
It notes that user's relative play frequency for the songs in the weeks immediately after being 
paid, and the play frequencies in the cluster(s) to which the songs belong after some time has 
passed, for instance, 6 months, and the songs have had a chance to become known. Then, the 
rank correlation is calculated between the user's play frequency and the cluster's. This 
correlation is then used as the basis for reconamending people to artists to pay to listen to their 
songs. To have a high correlation, the user must a) actually listen to the songs he is paid to listen 
to, and b) judge them similarly, relative to each other, to the way the cluster membership as a 
whole ultimately judges those same songs relative to each other. This embodiment is particularly 
appropriate in conjunction with the feature that displays songs ranked according to their average 
frequency of play among those who have heard the song at all (or other similar features). It 
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means that one user or a small number of users can be paid to hear a song, and if they like it, it 
will immediately be catupulted to the top of the goodness list for a cluster, encouraging still more 
people to listen to it, enabling good songs to become popular very quickly. 

In some embodiments, artists don't have a choice regarding who they pay. Instead, the artist pays 
a fee, and the system decides the best people to expose the work to and/or extract ratings from. 
This simplifies things on a number of levels - there can be a less complicated user interface, and 
the artist needs to do less work. In some embodiments, artists are presented with information to 
the effect that, for a certain fixed fee per listener, we will pay as many people as he desires 
(within limits of availability) to listen. Other embodiments enable listeners to set their fees, and 
the system chooses based upon the fees and calculated reliability associated with each one. 

Various forms of payment can be used in various embodiments. For instance, in some 
embodiments, money is not transferred, but instead an artist promises to make a certain number 
(or all) of his future recordings available to the listener for low or no cost. 

In some embodiments, an "appropriate submission rating" is associated with each artist. Users 
rate artists with respect to the appropriateness of submitting the given item to the given user for 
ratings and review, which depends upon how well the item corresponds to the tastes of the user 
who is being to rate or review. The idea is to create a persistent record of the appropriateness of 
an artist's submissions in order to discourage him from "spamming" the clusters by submitting 
too broadly. Users can see a summary of appropriate submission ratings for the artist in question; 
in some embodiments this is a simple average; in others, it is a Bayesian estimator of the 
expected rating; in other embodiments, other summarization methods are used. Similarly, artists 
can see summaries of the appropriate submission ratings generated by various users; this helps 
them avoid submitting to users who tend to give inappropriately low ratings. 

In some embodiments, there is a new songs list. This list simply lists songs that a relatively new, 
so that people who are interested in listening to new material can do so. This feature may appear 
in embodiments which do not contain any features for reimbursing those who listen to songs. In 
some embodiments where appropriate submission ratings are included, the songs may be listed in 
order of the measured appropriateness of the artist's past submissions. In further embodiments. 
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artists with the worst appropriateness measures may not be allowed to submit at all. Also, in 
some embodiments, artists who have submitted a certain number of songs in the past must have 
achieved a certain measured popularity if they are to be able to continue submitting. For 
instance, the average number of playings per week of the user's past submissions can be used; if 
5 it is below a certain point, no further submissions need be allowed. These calculations can be 
conducted globally or within the cluster membership. In order to keep this list from becoming 
too crowded, various means are used, such as always including songs for a limited, fixed set of 
time. 



10 Importance of the Administrator 

In some embodiments, the administrator plays a role much like that of a radio "DI." The 
administrator, sometimes called a "guide" in such embodiments, plays a role in which his own 
? personality and tastes given high visibiUty. For instance, in some such embodiments, the 
^ administrator of a cluster is the only person who is enabled to provide ratings and reviews which 
-"15 are visible to visitors and members of the cluster. In such embodiments, administrators of 
i different clusters compete with each other for the reputation of being the best and most reliable 
3 raters and reviewers; reliability is measured as discussed elsewhere. In further embodiments, 
, non-administrators can provide ratings and reviews, but these are given subordinate visibility to 
those generated by the administrator. 

120 

System Environment 

In various embodiments, the system runs on the World-Wide-Web, client-server systems based 
on the TCP/IP or other communications protocols, as a multi-user program accessed by users 
25 through terminal emulators, or other technical means. In all embodiments, one or more CPU's 
run the system, and users are enabled to access it from remote sites through an appropriate means 
of communication. 
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Glossary: 



Item: An article of the subject matter covered by a particular system. In various 
embodiments, an item can be a song, an album, a recording artist, a book, an author, a video, 
a director, an actor or actress, a painting, etc. 
User: A person accessing the system. 

Artist: Creator of items. For instance, the artist Herman Melville created the item "Moby 
Dick." 

Cluster: A cluster is primarily defined by its taste. In various embodiments, clusters have 
associated facilities such as chat rooms, discussion groups, item purchase facilities, etc. 
Cluster Visitor: A user who is using the facilities of a cluster but who has not been registered 
with the cluster as a member. 

Cluster Member: A member has registered by indicating that he wants to join the cluster. In 
some embodiments, his taste is used in refining the taste of the cluster. In various 
embodiments members have special rights, such as the right to post to a cluster discussion 
group or the right to take special discounts when making purchases. 
Cluster Administrator: The person or group of people who (in some embodiments) defines 
the taste of the cluster, moderates chat and discussion rooms, sends notices of events, etc. In 
some further embodiments, the taste defined by the administrator is further refined by 
members and/or visitors. 

Taste of the cluster: In some embodiments, defined by the cluster administrator. In other 
embodiments, it is specified only by members by such means as averaging ratings for various 
items in the subject domain; in still other environments tastes specified by the administrator 
and members are combined to form the taste of the cluster. Tastes are specified and 
calculated as described in the text of this disclosure. 



Appendix A— Some mathematical aspects 

This appendix discusses aspects of the invention that relate to certain mathematical calculations 
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One problem being addressed is the fact that people can supply ratings that are essentially 
random (due to not making the effort to provide truly meaningful ratings), or which are 
consciously destructive or manipulative. For instance, it has been commented that on 
Amazon.com, every time a new book comes out, the first ratings and reviews are from the 
author's friends, which are then counteracted with contradictory reviews from his enemies. 

The key to solving this problem is to weight each user's ratings according to their reliability. For 
instance, if the author's friends and enemies are providing ratings simply to satisfy personal 
needs to help or hurt the author, it would be helpful if those ratings carried a lower weight than 
those of other users who have a past reputation for responsible, accurate ratings. 

A problem solved by this invention is to provide a way to calculate that past reputation. 

This reputation can be thought of as the expected "value to the system" of the user's ratings. This 
is bound up with the degree to which the user's ratings are representative of the real opinions of 
the population, particularly the population of clusters which are more appreciative of the genre 
into which the particular artist's work fits. 

(To measure the user's overall contribution to the system, we can multiply 

the expected value of his ratings by the number of his ratings. Users who 

contribute a large number of valuable [representative] ratings are, in some embodiments, 

rewarded with a high profile such as presence on a list of people who are especially reliable 

raters.) 

One can measure the representativeness of a user's ratings by calculating the correlation between 
those ratings and the average ratings of the larger population. 

This analysis of measuring the representativeness of a user's ratings has s major limitation, 
however. It doesn't take into account the fact that a rating has much more value if it is the first 
rating on an item than if it is the 100*. The first rating will provide real guidance to those who 
are wondering whether to download or buy a recording before other ratings have been entered; 
the 100" rating will not change people's actions in a major way. So early ratings add much more 
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actual value to the community. Also, later raters might choose to simply copy earlier raters, so 
they can mislead any correlation calculations that way. 

Therefore, we want to weight earlier ratings more than later ones. The question is, how much 
5 more valuable is the T' rating than the second one, and the 2"" one more than the 3"^, etc.? 

Let S be the set of all items; let N be the number of all items; for ^ g 5 and 0 < / < A^, i', is the 
ith item. Let u be the user whose rating representativeness we wish to compute. 

10 Let g. „be the number of ratings received by s,. previous to w 's rating, (i.e., if u gives the first 
rating for item s, , „ is 0.) Let t, be the total number or ratings for the ith item. 

t Let ^ „ be M ' s rating of the i th item, normalized to the unit interval. Let a, be the average of the 

i ratings for the i th item other than u's, also normalized to the unit interval. 

45 

" Let A, and be constants. 

Let be the representativeness of m 's ratings, calculated as follows: 

320 qu= '=' N ^ ^ — -• 

1=1 

Then is a number on the unit interval which is close to 1 if the m 's ratings have tended to be 
predictive of those of the community as a whole, and 0 if not. 

25 /Li and are tuned for performance. Aj is a parameter of the cumulative exponential 

distribution determining the rate of "drop-off associated with the importance of a rating as more 
ratings for a given item precede u 's rating. is a parameter of the cumulative exponential 
distribution determining the rate at which the drop-off is associated with the number of total 
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ratings. For instance, if there are no ratings for an item other than u 's, the rating has no 
importance in calculating representativeness and is therefore given weight 0. These parameters 
can be set manually by intuitive understanding of the effect they have on the calculation. In some 
embodiments they are set by setting up a trainmg situation in which a number of users rate the 
items without the means to see other people's ratings; furthermore, these users are selected and 
given financial or other motivation for putting the effort in to input the most accurate ratings they 
can generate. These controlled ratings are averaged. Then standard computer optimization 
techniques such as simulated annealing or genetic algorithms are used to determine values for A; 
and ^2 that optimize the correlation between these averages and q^, is calculated using the 
entire population of users in usual viewing mode (such that they could see the ratings of other 
users). In preferred embodiments, tuning activities are carried out within the memberships of 
individual clusters. That is, the controlled ratings given by members of a cluster are used to tune 
the parameters relative to the general ratings given by other members of the same cluster. This is 
carried out for each cluster. If it is deemed that there aren't enough members of some clusters to 
effectively tune the parameters separately for each cluster, then in such cases the values for 
and are averaged across all clusters, and clusters without enough members can use those 
averaged values. In addition, if a given user has created ratings in multiple clusters, some 
embodiments simply use the average of his representativeness numbers for all clusters as his 
single viewable representativeness and some clusters display separate representativeness 
numbers depending on the cluster in which the numbers are being viewed. 

The representativeness of a user is then used for various purposes in various embodiments. In 
some embodiments, it is presented to artists as a reason to pay a particular user to providing 
ratings and reviews for new items, hi further embodiments, it is used as a weight for the user's 
ratings when calculating overall average ratings for an item. In some embodiments, listings are 
provided showing the users' rankings as trustworthy raters, giving "ego gratification"; in must 
such embodiments these numbers are also available when viewing the user' s profile, along with 
other information presented about the user. 

It should not be construed that this invention is dependent upon the particular calculation method 
for representativeness which is described above. 
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For example, another embodiment uses the following algorithm for computing the 
representativeness i?„ of user u : 

Calculate the average rating for each item, not counting u 's rating. For each item, rank the 
5 population of ratings in order of then: distance from the average rating. In embodiments where 
discrete ratings are used (that is, some small number of rating levels such as "Excellent" to 
"Poor" rather than a continuous scale), there will be ties. Simply give each rating a random rank 
to eliminate ties. For instance, if the average rating is 3, and the ratings in order of their distance 
from the average are, 3, 3, 4, 2, 5, 5, 1, then after randomization one of the 3's, randomly chosen, 
10 will have the top rank, the other will have the next highest rank, the 4 will have the third highest 
rank, etc. 

Call the distance from the average, based on these ranks, the "discrete closeness." Label the 
3i ranks such that the closest rating has rank 0, the next closest 1 , etc., up to - 1 , where N is the 
r:l5 total number of ratings of the item. Now pick a random number on the interval (O, l] . Add it to 
4= the discrete closeness. Call this quantity the "real closeness" of user u to the average for the ith 
m item and label it „ . If user u ' s ratings are randomly distributed with respect to the average 

rating for each item, then the population of p,. „ ' s has a uniform distribution on the unit interval. 

i_L If 

U It can be shown that, due to this, the quantity x„ = -l^log^ " P,„) has chi-square distribution 

J" 20 with IN degrees of freedom. A chi-square table can then be used to lookup a p-value, p[, 
□ relative to a given value of x„ . The quantity p^=\-p[is also a p-value and has a very useful 
meaning. It approaches 0 when the distance between u 's ratings and the averages are 
consistently close to 0, "consistently" being the key word. Also, as increases, p„ becomes still 
closer to 0. It represents the confidence with which we can reject the "null hypothesis" that m 's 
25 ratings do not have an unusual tendency to agree with the average of the community. So p„ is 
an excellent indicator of the confidence we should have that user u consistently agrees with the 
ultimate judgement of the community (in most embodiments, this is the community within a 
taste cluster). 
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Preferred embodiments using the chi-square approach also include weights relative to 
how early u was in rating each item and to take into account the number of ratings for 
each item. Letw, „ = e'^'*" (^-e~^'' ), where g,.,„ and t, are defined as before. Let 

Then 

p„'=Prob{y„<i;}=S— , 

where 



We use p„ = 1 - p: as the measure of representativeness, with numbers closer to 0 being better, 
as before. 

Finally further embodiments provide weights for one or both of the terms in the expression for 
„ . Proper weights can be found using the same procedures as are used for finding A, and A, ; 
usi"ng genetic algorithms and other optimization techniques, in some embodiments all these 
weights are found at the same time. 

In general, in various preferred embodiments of the invention, various algorithms that allow a 
representativeness number to be calculated which includes the predictive nature of the user's 
ratings are used, so the invention as a whole has no dependency on any particular method. 

When displaying the quantities calculated as the representativeness numbers, preferred 
embodiments calculate rankings of the various users with respect to those numbers, or percentile 
rankings, or some other simplifying number, since the representativeness numbers themselves 
are not intuitively comprehensible to most users. 

Another useful feature emerges if we take g,„ to be a measure of elapsed time in days between 
the public release of an item and the time the user rated it (which can be 0 if the review preceded 
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or coincided with the public release), and A2 = ^ . Then the approaches mentioned above for 
calculating representativeness can be extended to such situations as measuring the value of a user 
in predicting the overall long-term sales of particular items (or even to predicting stock market 
prices and movements and other similar applications). 

For instance, in some embodiments, a correspondence is made between ratings and ultimate sales 
volumes. In one such embodiment, the following algorithm is executed. For each rating level, all 
items with that average rating (when rounded) are located which have been on sale for a year or 
longer. Then, within each cluster, average sales volumes for each rating level's items are 
calculated. Then this correspondence is used to assign "sales ratings" to each item based on the 
total sales of that particular item; the actual sales are matched to the closest of the rating- 
associated levels of average sales, and the corresponding rating is used as the sales rating. (If 
there hasn't yet been enough activity in a particular cluster to conduct this exercise meaningfully, 
system-wide averages are used.) 

In this embodiment jt?^ „ is computed using rankings of distances from the sales rating rather than 
from the average rating. Then is set to ^ (in other words, the (l - e"^'" ) term is set to 1). 
Then we calculate the representativeness, p^, as before. 

As with the case of calculating representativeness with respect to general ratings, it should not be 
construed that this invention is dependent upon the specific calculations given here for 
calculating a user's ratings' representativeness with respect to sales; other calculations which 
accept equivalent information, including the user's ratings, the sales volumes, and time data for 
ratings and sales (or, equivalently, elapsed time data), outputting a representativeness which 
involves a predictive component, will also serve the purpose of providing equivalent means for 
use by the invention overall. 

For instance, in some embodiments, a rank-based technique is used for calculating 
representativeness. In one such embodiment, time data is used to determine the items that the 
user rated soon after their release (or at or before their release) and that have now been on the 
market long enough to meaningfully measure sales volumes. These items are used to perform 
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Spearman rank correlation between the user's ratings and general ratings or sales volume; other 
items are ignored. Other embodiments perform rank correlation based on this restricted sample 
and separately perform rank correlation upon all items rated by the user, and perform a weighted 
average on the results. 

5 

Note 1: In some embodiments, it is possible for a user to change his review and rating of an item 
over time, since he may come to feel differently about it with more experience. But for purposes 
of calculating, his earlier ratings are stored. In preferred such iterations, the last rating of an item 
entered on the first day that he rated that item is used. 

10 

Note 2: In cases where the cluster has too few ratings or sales to do meaningful calculations, 
"virtual" clusters can be created by combining clusters with similar taste signatures into one 
larger clusters for purpose of computing representativeness. In preferred such embodiments, 
clusters are successively added to the original cluster, and the representativeness recalculated as 
1 5 long as the representativeness number continues to rise with each iteration. When it declines, this 
process ends. The maximum representativeness number obtained in this way is the one assigned 
to the user. 

Note 3: In various embodiments the discussed calculations are conducted at either the "artist 
20 level" or "item level". That is, in some embodiments the artists are rated and calculations done 
from those ratings and in others item ratings are used. 



Appendix B 

25 Introduction 

This brief document presents a methodology for clustering songs by calculating "information 
transfer" as that value is calculated within the framework of Shannon entropy. 
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First, we will present a simple clustering algorithm will be presented, and second, we will 
present Python source code for calculating information transfer between clusters and users. 
Together, these techniques comprise a complete solution for clustering songs. 



5 Clustering Algorithm 

For simplicity, and to maximize the probability of showing that our basic approach can find 
useful clusters, we will use one of the most simple clustering algorithms possible, which does not 
contain possible optimizations to improve computational speed. 

10 Here are the steps: 

For each song not yet assigned clusters (note that the first time the system 
is started, this would be all songs) : 
Randomly assign a cluster. 
15 Repeat: 

For each song (including new songs not yet added to clusters) : 

For each cluster other than the original one the song is in: 

Compute the change in total system information transfer that 
would occur if the song were moved to the other cluster. 

20 If at least one such potential move would result in an 

increase of information transfer: 

Execute the move that results in the greatest 
increase . 

If no movements occurred in the "For each song" loop: 
25 Delay until there is a new song to process. 



The above can continue until we want to bring down the system. 

It would be great if an administrator console could see, via the Web, a history of the number of 
30 distinct songs moved per hour, so that we can monitor how the system is evolving toward 
stabiUty. If no songs were moved in recent hours, we know that optimization is complete (of 
course that will only happen if we stop adding songs). 
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Calculating Information Transfer 

A Python example will be used to describe the algorithm. 

5 At the top of the Python listing is a matrix. Each row represents a cluster and each column 
represents a user. The numbers represent the number of songs in ith cluster that are associated 
with the jth user. For example the 10"^ user is associated with 3 songs in the 4'*' cluster. With 
Radio Userland data, this would mean that the user has played the song. 

10 When a song is moved from one cluster to another, a number of counts in the matrix may be 
affected, both in the originating cluster and the target cluster, because that song will be 
associated with a number of users. Subsequently, the clustering algorithm, which must "try" 
various possible movements to find the best one, will be very computationally expensive. 
Various tricks can be used to minimize the number of computations to be done; the Python code 

15 below uses virtually no such tricks. It would be appropriate for early Java versions to be equally 
free of optimizations; for one thing, the fewer optimizations, the less chance for bugs to be 
introduced into the code. Then we can refine from there, always checking to make sure our 
optimizations don't change the output numbers. We can check this by loading the database with 
test data, setting the random number generator to a constant seed, and running the algorithm after 

20 each enhancement. The resulting clusterings should always be identical after the same number of 
iterations. NOTE: There should therefore be some easy way to load the same test data into the 
system repeatedly. 

Obviously, a line-by-line conversion to Java probably doesn't make sense. For one thing, an 
25 index-based data structure will probably not be appropriate, because the ZD's of the users, after 
filtering, will not be contiguous. And some users may be dropped from the processing over time 
for one reason or another. So some kind of map structure would seem to be more appropriate. 
The row-and-column naming convention would therefore probably also not make sense in the 
Java version. 
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Note 1: In the initial release, let's count all user-song-associations as being a 1 no matter how 
many times the user played the song. So, to get a count of 3 in a cell in the matrix, a user must 
have played 3 distinct songs. Future versions may count each play as a separate association. 

5 

Note 2: it is traditional to use log base 2 when doing Shannon entropy calculations, but if there is 
no log base 2 function in the Java libraries, we can use natural logarithms. 



PYTHON CODE BEGINS HERE 

10 
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IstLst = [ 

[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 0, 0, 0, 0, 0, 0, 0, 0, 
IstLst . append { 

[ 0, 1, 1, 1, 0. 0. 0, 0, 0, 0, 0, 0, 0, 0. 0, 0, 0, 0, 0, 0, 0. 
IstLst . append ( 

[ 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0. 0, 0, 0, 0, 0, 0, 0, 0, 
1 s tLs t . append ( 

[ 0, 0, 0, 0, 0, 0, 0, 1, 9, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
1 s tLs t , append ( 

[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
IstLst . append { 

[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 
IstLst . append ( 

[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 



import math 

def minusPLogP( float_ }: 

Takes a probability associated with a particular value for a 
random variable, and outputs that value's contribution to the 
Shannon entropy. 

if float_ == 0.0: 

return 0 
else : 

return - float_ * math. log ( float_ ) 

def countCol ( ) : 

The number of users. 

II II It 

return len{ IstLst [ 0 ] ) 
def countRow{) : 

(t M II 

Return the number of clusters. 

II II II 

return len{ IstLst ) 
def sumRow( int_row ) : 

Sums user-song-association instances for a single cluster. 



int_sum = 0 

for int_col in range { countCol { ) ] 
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int_sum = int_suin + lstLst[ int_row ] [ int_col ] 
return int_suin 

def sumCol ( int_„col ) : 

H It II 

Sums user-song-association instances for a single user. 
int_sum = 0 

for int_row in range ( countRowO): 

int_sum = int_sum + lstLst[ int_row ][ int_col ] 
return int_sum 

def sumTotal ( ) : 

Sums user-song-association instances across the universe. 

int_sum = 0 

for int_row in range ( countRow ( ) ) : 

for int_col in range { countColO): 

int_sum = int_sum + lstLst[ int_row ] [ int_col ] 
return int_sum 

def userUncertainty ( ) : 

M It II 

Loop through the users, calculating the probability, 
p, that a randomly 

chosen user-song-association instance would be associated 
with the user being looped through. 

Then sum p log p for all users. 

That is the Shannon uncertainty for the user population. 

It II II 

f loat_sum = 0.0 

int_total = sumTotal ( ) 

for int_col in range ( counted { ) ) : 

f loat_p = float { sumCol { int_col ) ) / int_total 
float_sum - float_sum + minusPLogP ( float_p ) 

return float_sum 

def clusterUncertainty ( ) : 
II )i It 

Loop through the clusters, calculating the probability, p, 
that a randomly 

chosen user-song-association instance would be associated 
with the cluster being looped through. 
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Then sum p log p for all clusters. 

That is the Shannon uncertainty for the cluster population. 

II K II 

float_sum = 0.0 

int_total = sumTotal{) 

for int_row in range { countRowO): 

float_p = float ( sumRow{ int_row )) / int_total 
float_sum = float_sum + minusPLogP{ float_p ) 

return float_suin 



def j ointUncertainty { ) : 
II II II 

Loop through all unique combinations of user-cluster, 
calculating the probability, p, that a randomly 
chosen user-song-association instance would be associated 
with the user-cluster combination being looped through. 

Then sum p log p for all user-cluster combinations. 

That is the joint Shannon uncertainty for the cluster population. 

II II K 

float_sum = 0.0 

int_total = sumTotal ( ) 

for int_row in range ( countRow{)): 

for int_col in range ( countColO): 

float_p = float { lstLst[ int_row ][ int_col ]) / int_total 
float_sum = float_sum + minusPLogP( float_p ) 

return float_sum 

def calculateinf ormationTransfer ( ) : 

Calculate the information transfer. 

return userUncertainty ( ) + clusterUncertainty ( ) - j ointUncertainty { ) 

print 'User uncertainty: userUncertainty ( ) 

print 'Cluster uncertainty: clusterUncertainty { ) 

print 'Joint uncertainty: j ointUncertainty ( ) 

print 'Information transfer: \ calculateinf ormationTransf er { ) 

An Optimization Strategy 

This strategy is used by preferred embodiments of the information transfer algorithm. 
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Add new user-song associations in batches, allowing a significant period of time between each 
batch. 

5 Since the total that is in the denominator of all the p calculations will not change in between 
batches, that makes it possible, at the end of a batch load, to create a one-diminsional array to 
represent the p log p values, where the index is the numerator in the p calculation. Thus, each 
relevant p log p calculation only needs to be performed once, and is then reused. 

10 Instead of actually re-allocating memory for the array at the end of each batch load, the array can 
be zeroed out, A 0 in an element indicates that p log p has not yet been calculated. So, when a 
value is needed for p log p, the appropriate element is checked, and if it is 0, it is calculated. If it 
is non-zero, then the value that is there is used. 

Appendix C 

15 iVERSION 12 08/27/00 

tCopyright (c) 2000 by Virtual Development Corp. All Rights Reserved. 

#Usage Notes########################## 
20 # MimimumConvergencelterations in the Config file must be at least 1. (See BUGS.) 

# MinirnumConvergencelterations "beats" MaxTirae. It will run for the minimum 

# number of configurations, then run until MaxTime. 

25 

# work_ = Work instance 

# rel_ = Relatable instance 

# clus__ - Cluster 
30 # clst_ = ClusterSet 

# clss_ = ClusterSetSignature 

import whrandom 
import math 
35 import xmllib 
import copy 
import time 
import ConfigParser 
import urllib 
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import sys 



# utility stuff 

G_generator = whrandom . whrandom ( ) # For why global, see 
http : / / starship . python . net /crew/donp/ script /sample .py 
#G_generator . seed (1,1,3) 

def shuf f le {sample_size) : # See 

http: //starship. python.net/crew/donp/script/sample.py 

'Moses and Oakford algorithm. See Knuth, vol 2, section 3.4.2. 
Returns a random permutation of the integers from 1 to 
sample_size . 

assert type { sample_size) == type(O) and sample_size > 0 

global G_generator 

list - range (1, sample_size + 1) 

for ix in xrange (sample_size - 1, 0, -1) : 

rand_int = G_generator . randint ( 0 , ix) 

if rand_int == ix: 
continue 

tmp = list[ix] 

list[ix] = list [rand_int] 

list [rand_int] = tmp 
return list 

^ from http: //starship. python.net/pipermail/python-de/1997ql/000026 .html 
#" Converter module from strings to HTML entities" 

# The code is modified slightly modified to use the encodings 

# the python xml parser defaults to decoding, rather than using 

# htmlentitydef s . 



EntitiesByOrd: 
ord( '>' ) 
ord( ) : 
ord ( ' " ' ) : 
ord ("•■') : 



= { ord{ '<' ) : ' It' , 
: 'gt', 

• amp ' , 

' quot ' , 

' apos ' } 



def toXML(s) : 
pos=start=0 
result=" " 
flush=0 

while pos<len(s) : 
c=ord (s [pos] ) 
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if EntitiesByOrd.has_k:ey (c) : 
flush=l 

item="Sc"+EntitiesByOrd[c] + " ; " 
if flush: 

result=result+s [start :pos] +item 
start=pos+l 
flush=0 
pos=pos+l 
result =result+s [start :pos] 
return result 

def computeEvenRankUnitRanks { lstTup_input ): 

# SHOULD BE IN DATA object 

# Suppose 100 values are tied for second place, and 1 

# is alone in first. It should not be assumbed that we 

# should put the lone value in the top percentile, because 

# it could easily be due to noise. So, we compromise by 

# saying there are 2 ranks, and we assign .25 to everyone in the low 

# and .75 to the one in the high. 

# We only use the first element in the tuple for ranking. 

# Output list has the same data as the input, but in 

# rank order, and each tuple has two extra elements 

# at the end: the integer rank (ties are counted as 

# the same rank; best is highest) and the unit rank. 

# FURTHER ADJUSTMENT DURING TIME OF LITTLE DATA! Hi If 

# there are two input sort field values, 1 and 2, the 

# original algorithm gives outputs ,25 and .75. But that 

# still means that the low level is much closer to 0 

# than the high level is. That makes no sense. 

# So, we change the levels to .62 5 and ,875. 



lstTup_input . sor t { ) 

assert lstTup„input [ 0 ][ 0 ] 1= None # logic assumes first sort value is not None 
lstTup_intermediate = [] 
int_rank = 0 

any_^reviousSortValue = None 
for tup_ in lstTup_input : 

if any^reviousSortValue 1= tup_[ 0 ]: 
int_rank = int_rank + 1 
any_previous Sort Value = tup_[ 0 ] 
lstTup_intennediate .append { tup_ + ( int_rank, )) 

lstTup_output = [] 
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for tup_ in lstTup__intermediate : 

float_ = { tup_[ -1 ] - 0.5 ) / float { int_rank ) 
float_tuning = Conf ig . f loat_tuningRankBottom + float_ * ( 1.0 - 
Conf ig. f loat_tuningRankBottom ) #see note above for little data 
lstTup_oucput, append ( tup_ + { f loat_tuning, ) ) 

return lstTup_output 

def computeAverageUnitRanks ( lstTup_input ) : 

# NOT USED IN CURRENT CODE 8/24/0 0 

# The first element in the tuple is the only one used 

# in the ranking. 

# The output list contains tuples identical to the input 

# list but with an added element at the end, which is 

# the ranking, with dups assigned to the average ranks 

# of the dups. 

def isLastlnDupSet ( int_index, lstTup_ ): 
if len{ lstTup_ ) int_index + 1: 

return 1 
else : 

if lstTup„[ int_index ][ 0 ] != lstTup_[ int_index + 1 ][ 0 ]: 

return 1 
else : 

return 0 

float_offset = 1.0 / ( 2.0 * len( lstTup_input )) 
lstTup_input . sort { ) 
lstTup_output = [] 
int_startDupIndex = 0 
int_limitlndex = len{ lstTup_input ) 
lst_currentDupSet = [] 

for int_index in range ( int_limit Index ): 

if isLastlnDupSet ( int_index, lstTup_input ): 

lst_currentDupSet. append { lstTup„input [ int_index ] ) 

# Compute average unit rank 

f loat_averageRank = ( int_index + int„startDupIndex ) / 2.0 

f loat_averageUnitRank = float_offset + f loat_averageRank / int_limit Index 

# Add to output list 

for tup_ in lst_currentDupSet : 

lstTup_output. append ( tup_ + { f loat_averageUnitRank, )) 

# Set the stage for next iteration 
int_startDupIndex = int_index + 1 
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lst_currentDupSet = [ ] 
else : 

lst_currentDupSet . append { lstTup„input [ int_index ]) 
return lstTup_output 



# Classes 
class Config: 

# When an instance is created, the class attributes are populated; 

# at that point, the instance itself can be thrown away. 

str_clusterFile = None 
str_useFile = None 
str_oldUseFile - None 
int_createClusterCount = None 
f loat_maxTime = None 

int_ininimumConvergenceIterations = None 

str_outClusterFile = None 

f loat_tuningRankBottom = None 

f loat„tuningZeroWeight = None 

C_str_conf igFile - 'clusterconfig.txt' 
C_str_sectionNaine = 'Configuration' 
C_str_clusterFile = ' InClusterFile ' 
C_str_useFile = 'UseFile' 
C_str_oldUseFile = 'OldUseFile' 

C„str_createClusterCount = ' CreateClusterCount ' 
C__str_raaxTiine = 'MaxTime' 

C_str_minimuinConvergence Iterations = "MinimumConvergenc alterations " 
C_str_outClusterFile - ' OutClusterFile ' 
C_str_tuningRankBottoni = ' TuningRankBottom" 
C_str„tuning Zero Weight = ' TuningZeroWeight ' 

def init ( self ): 

configParser = Conf igParser . Conf igParser { ) 
conf igParser .read{ Conf ig . C_str_conf igFile ) 

Conf ig. str_clusterFile = conf igParser . get { Conf ig . C_str_sectionName, 
Conf ig.C_str_clusterFile ) 

Conf ig. str_useFile = conf igParser . get ( Conf ig. C_str_sectionName, 
Conf ig . C„str_useFile ) 

Conf ig. str_oldUseFile = conf igParser . get ( Conf ig . C_str_sectionNaitie, 
Conf ig .C_str_oldUseFile ) 

Conf ig. int_createClusterCount = int ( conf igParser . get ( Conf ig . C_str_sectionName , 
Conf ig . C„str_createClusterCount ) ) 
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Conf ig. f loat_maxTime = f loat ( conf igParser . get { Conf ig . C__str_sectionName , 
Conf ig . C_str_maxTime )) 

Conf ig . int_minimumConvergence Iterations = int ( conf igParser . get ( 
Conf ig . C_str_sectionName , Conf ig, C_str_minimumConvergence Iter at ions ) ) 

Conf ig. f loat_tuningRankBottom = f loat ( conf igParser . get ( Conf ig . C_str_sectionName, 
Conf ig . C_str_tuningRankBottom ) ) 

Conf ig. f loat_tuningZeroWeight = float { conf igParser . get ( Conf ig.C_str_sectionName, 
Conf ig.C_str_tuningZeroWeight )) 

Conf ig. str„outClusterFile = conf igParser . get ( Conf ig.C_str_sectionName, 
Conf ig. C_str_outClusterFile ) 

class Data: 

# This is a singleton. One instance is created, and that creates everything. 

# "Longnames" are of the format "Beatles - Hey Jude" . The artist and the title 
separate by 

# spacedashspace . Each Work object is uniquely identified by a Longname . 

singleton = None 

def init ( self ) : 

assert not self. class .singleton 

self. class .singleton = self 

self .dictStrDictStrNone_userLongnaine = {} 

self .dictStrDictStrFloat_longname2LongnainelUnitRank { } 

self , die tLongnaineWork_ = {} 

self .dictStrDictStrInt_longnamelLongname2Count = {} 
self .dictStrInt_longnameUniqueCount = {} 
self , lstWork_ = [] 

assert Conf ig . str_useFile 
print ' about to read data ' 

self. readUserPlayStats ( Conf ig . str_useFile ) 

print • about to generate use counts ' 

self , generateUseCounts { ) 

print 'about to generate unit ranks' 
self. generateUnitRanks ( ) 

def displayCheckingInf o ( self ): 

dict_russians = self . diets trDic tStrFloat_longnaine2LongnamelUni tRank [ 'Sting - 

Russians' ] 

lst_russians = dict_russians . items ( ) 
lst_russians . sort ( ) 



def getWorks ( self ): 
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return self . lstWork_ 

def getUnitRanks ( self ): 

assert self . dictStrDictStrFloat_longname2LongnamelUnitRank 
return self . die tStrDictStrFloat„longnanie2LongnaraelUnitRank 

def getAssociatedLongnames { self, str_longnaitie ): 

assert self . die tStrDic tStrFloat_longname2LongnamelUni tRank . has_key { str_longname ) 
return self , dictStrDictStrFloat_longname2LongnamelUnitRank [ str_longname ] , keys ( ) 



def readUserPlayStats { self, str_f ileName ): 

if str_fileName[ :7 ] == "http://": 

fil_ = urllib.urlopen (str_f ileName) 
else : 

fil_ = open (str„f ileName, ' r ' ) 
str_ = fil_.read(} 
fil_. close { ) 

class UseListContainerParserl ( xmllib . XMLParser ): # Embedded class, only used 
here 1 

# THIS LOGIC ASSUMES UNIQUENESS AT USER/SONG LEVEL IN THE INPUT XML FILE I 1 

def init { self, data_ ) : 

self . str_currentUser = None 

self.data_ = data_ 

xmllib. XMLParser . .init ,{ self ) 

def start__entry { self, dict„ } : 

# str_work is the title of the work, which must be distinguished from Work 
obj ects ! 

if ( self - str_currentUser 1= 'mike3k@mail.com' 
and self .str_currentUser 1= 'jake@jspace.org' 
and self. str_currentUser != 'jake@braincase.net' }: 
if int( dict_[ 'count' ] ) > 1: 

str_artist = intern { dict_ [' artist ' ] ) 
str_work = intern ( dict_[ 'work' ] ) 

str_longname = intern{ ' %s - %s ' % ( str_artist, str_work )) 

dict_ = self .data„. diets trInt_longnameUniqueCount 
if dict_.has_key { str_longname ) : 

diet_[ str_longname ] = dict_[ strjongname ] + 1 
else : 

dict__[ str_longname ] = 1 



def start_useList ( self, diet_ ): 
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self . str_currentUser = dict_[ 'user' ] 

class UseListContainerParser2 ( xrul lib. XML Parser ): # Embedded class, only used 
here ! 

def init ( self, data__ ) : 

self . str_currentUser = None 
self.data_ = data_ 
xmllib.XMLParser . init ( self ) 

def start_entry( self, dict_ ) : 

# str_work: is the title of the work, which must be distinguished from Work 
objects I 

str_artist = intern { dict_ [' artist ' ] ) 
str_work = intern { dict„[ 'work' ] ) 

str_longname = intern( ' %s - %s ' % { str_artist, str_work )) 

if ( self . data_. dictStrInt_longnameUniqueCount .has_key ( str_longname ) and 
self .data_. die tStrInt_longnameUniqueCount[ str_longname ] > 1 ): 
if self .da ta_. diets trDictStrNone_userLongname.has__key{ self . str_currentUser ) : 
if self -da ta_. diets trDictStrNone_userLongnaine [ self . str_eurrentUser 
] .has_key( str_longname ) : 

pass # Already there 1 

else : 

self .data_. die tStrDictS trNone_userLongname [ self . str_currentUser ] [ 
str_longname ] = None 
else: 

self .data_.dictStrDictStrNone_userLongname[ self . str_currentUser ] = { 
str_longname : None } 

if not self ,data_.dictLongnameWork_.has__key( str_longname ) : 
work__ = Work( strjongname , str_artist, str_work ) 
self .da ta_. Is tWork_. append ( work_ ) 

self .data_.dictLongnameWork_[ str_longname ] = work_ 

def start__useList ( self, dict_ ): 

self . str_eurrentUser = dict_[ 'user' ] 

parser_l = UseListContainerParserl ( self ) 
parser_l . f eed ( str_ ) 
parser_l . close ( ) 

parser_2 - UseListContainerParser2 { self ) 
parser_2 . feed( str_ ) 
parser_2 . close { ) 

def generateUseCounts ( self ) : 

dictStrDictStrInt_longnamelLongname2Count = {} 
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lstStr_user =self . dictStrDictStrNone_userLongname . keys ( } 

int_loopCount = 0 

for str_user in lstStr_user: 

int_loopCount = int_loopCount + 1 

int_innerLoopCount = 0 

sys . stdout , flush ( ) 

for str„longnamel in self . diets trDictS trNone_userLongname [ str_user ] . keys { ) : 
int_innerLoopCount = int_innerLoopCount + 1 

# print 'deep in loop, int_innerLoopCount , ' of ' , 
len(self .dictStrDictStrNone_userLongname [ str_user ]) 

for str_longname2 in self . die tStrDictStrNone_userLongname [ str_user ].keys(): 

# if str_longnaitiel str_longname2 : songs played by only 1 user can still be 
clustered due 

# to the user's other choices... 
not counting cases 

^ where the two are equal would 

eliminate them, and 

^ should cause logic that loops 

through all of the songs 

# looking for unitRanks to fail 
if str_longnamel 1= str__longname2 : 

if dictStrDictStrInt_longnamelLongname2Count .has„key { str_longnamel ) : 
if dictStrDictStrInt_longnamelLongname2Count [ str_longnamel ] .has_key( 
str_longname2 ) : 

dictStrDictStrInt_longnamelLongname2Count [ str_longnamel ] [ str_longname2 

] - \ 

dictStrDictStrInt_longnamelLongname2Count [ str_longnamel ] [ 

str_longname2 ] + 1 
else: 

dictStrDictStrInt_longnamelLongname2Count [ str_longnamel ] [ str_longname2 

] = 1 

else : 

dictStrDictStrInt_longnamelLongname2Count [ str_longnamel ] = { 
str_longname2 : 1 } 

self . dictStrDictStrInt_longnamelLongname2Count = 
dictStrDictStrInt_longnamelLongname2Count 



def genera teUni tRanks { self ): 

# "Unit ranks" are ranks scaled down to the unit interval. For instance, the lowest 

# rank out of 57 elements is 0, and the highest is 56/57 = .98245614035. But, we 

# also perform averaging, so ranks that extreme should be unusual. 

# Consider longnamel to be a work "associated" with longname2 . Longname2 is the 
work 

# for which we are generating a profile; this profile involves the 

associated 

# Longnamel works . 
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# That is, a profile for a longname2 would contain all 

# the longnamel ' s that are associated with it. For each associated work, considered 
across all 

# main works, there is one rank for each main work, 

# that's where the uniform distribution comes from. The alternative would be: for 
each main work have 

# one rank for each associated work; then some associated works would NECESSARILY 
have very low rank. 

# In contrast, using the approach presented, all associated works CAN have high 
rank -- but under 

# the null hypothesis the distribution would be uniform. 

self .diet StrDictStrFloat_longname2LongnamelUnitRank = {} 

for str_longnamel in self .dictStrDictStrInt_longnamelLongname2Count . keys () : 
lstTupIntStr_ = [] 

dictStrInt_longname2Count = self . dictStrDictStrInt_longnamelLongname2Count [ 
str_longnamel ] 

for str_longname2 in dictStrInt„longname2Count . keys ( ) : 

lstTupIntStr_. append { { dictStrInt_longname2Count [ str„longname2 ], str_longname2 

)) 

if str_longnamel == 'Elton John - Levon' : 

lstTupIntStr_. sort ( ) 
lstTupIntStrIntFloat_ = computeEvenRankUn it Ranks ( lstTupIntStr_ ) 
for int_ in range ( len ( lstTupIntStrIntFloat„ )): 

tupIntStrIntFloat_ = lstTupIntStrIntFloat_ [ int_ ] 

float_ = tupIntStrIntFloat_[ -1 ] 

str_longname2 ^ lstTupIntStrIntFloat_ [ int_ ] [ 1 ] 

if self .dictStrDictStrFloat_longnarae2LongnamelUnitRank.has_key{ str_longname2 ): 
self .dictStrDictStrFloat_longname2LongnamelUnitRank[ str_longname2 ] [ 
str__longnamel ] = float_ 
else : 

self .dictStrDictStrFloat_longname2LongnamelUnitRank[ str_longname2 ] = { 
str_longnamel : float„ } 

# fil_. closed 

# computeAverageUnitRanks 

class Relatable: 

def getName( self ): 
assert 0 

def getAssociatedRelatedness ( self, str_otherName ): 
assert 0 

def getAssociatedLongnames { self ): 
assert 0 
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def getOverallRelatedness ( self, rel_ ): 

f loat_zeroWeight = Conf ig . f loat_tuningZeroWeight 

f loat„siim = 0.0 

f loat_divisor = 0.0 

for str„name in self . get Assoc iatedLongnames {) : 

float_other = rel_. getAssociatedRelatedness ( str„name ) 
if float_other == None: # Defensive programming 

float_other =0.0 
if float_other == 0: 

float_weight = f loat_zeroWeight 
else : 

float_weight = 1.0 
f loat_divisor = f loat_divisor + float_weight 

float_self ^ float { self . getAssociatedRelatedness ( str_name )) # Cast is 
defensive programming 

floatjroduct = float_self * float„other * float_weight 
float_sum = float_sum + f loat_product 
if f loat_divisor : 

float_overallRelatedness = float_sum / f loat_divisor 
else : 

f loat_overallRelatedness =0.0 
return f loat_overallRelatedness 

class Work( Relatable ): 

# The xml attribute 'work' is the title of the work, which must be distinguished f: 
Work objects, 

# which contain artist info as well as title info! 

def init ( self, str_longname , str_artist, str„work ) : 

# The "Longname" of the work, for purposes of this program, is the artist + the 
work title. 

Data. singleton. getAssociatedLongnames ( str_longname ) 
self . str_longname = str_longname 
self .str_artist = str_artist 
self . str_work = str_work 

def getName( self ): 

re turn s e 1 f . s t r_l ongname 

def getArtist( self ): 
return self . str_artist 

def getAssociatedRelatedness { self, str_longname ): 

dictStrDictStrFloat_longname2LongnamelUnitRank = Data . singleton . getUnitRanks { ) 
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dict_ = dictStrDictStrFloat_longname2Longnar[ielUnitRank tUsing intermediate name 
ust for clarity 

assert dict_. has^key { self . str_longname ) 

if dict_[ self . str_longname ].has_key( str_longnaTne ): 

f loat_unitRank = dict_[ self . str_longname ] [ str_longname ] 
else : 

f loat_unitRank = 0.0 
return f loat_unitRank 

def getAssociatedLongnames ( self ) : 

return Data. singleton. getAssociatedLongnames ( str_longname ) 

:lass Cluster ( Relatable ) : 

# To understand this clasS; it's important to understand the difference between a 

# cluster's membership list and its profile. Both of them involve a group of 

# objects subclassed from Relatable. But the membership list ( self . lstRel_member ) 

# determines the objects that are currently members of a cluster; whereas, the 

# profile {self .dictStrFloat_longnameRelatedness) is a description of the current 

# "center" of the cluster for purposes of measuring the distance between the 

# cluster and an object that is a candidate for membership in the cluster. 

# Normally, all candidate objects are assigned to a cluster before the profile 

# is computed; these assignments are based on the old profiles. For instance, 

# when clusters are being generated for the first time, the old profiles are 

# random. When clusters are being regenerated based on old clusters read from 

# an xml disk file, the profiles from the disk file clusters are used as the 

# old profiles . 

str_nextAutomaticName = '1' 

def init ( self, str_name=None ) : 

self .lstRel_member = [] 

self .dictStrFloat_longnameRelatedness = {} 
if str_name: 

self . str_name = str_name 
else : 

int_ = int( self. class . str_next Automat icName ) 

self . str_name = self. class . str_nextAutomaticName 

self. class . str_nextAutomaticName = str{ int_ + 1 ) 

def getName ( self ): 
return self . str_name 

def getMembers ( self ): 
re turn self. 1st Re l_member 

def getAssociatedRelatedness ( self, str_longname ): 
# 1 or 0 
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if self .dictStrFloat_longnameRelatedness.has_key ( str_longname ): 
return self . diet StrFloat_longnameRel at edness [ str_longname ] 

else : 

return 0 . 0 

def getCountUniqueArtist ( self ): 
if not self . lstRel_member : 
return 0 

assert self - lstRel_member [ 0 ]. class == Work 

dict_ = {} 

for work_ in self . lstRel_meinber : 

dict_[ work_.getArtist{) ] = None 
return len ( dict_ ) 

def getAssociatedLongnames ( self ): 

return self , die tS trFloatJongnameRelatedness . keys { ) 

def addToCluster { self, rel_ ): 
self . Is tRel_member. append { rel_ ) 

def addToProf ile ( self, strLongnaiue ): 

# Used for initializing empty profile for later clustering, 
self -die tStrFloat_l ongnameRelatedness [ strLongname ] = None 

def computeClusterProfile( self, bool_binary ): 

# Normally, relatedness of each member to the cluster is binary — 

# 1 if it's in the diet, 0 otherwise. However, in the final 

# cluster confergence, it makes sense to do a 2-stage profile computation; 

# first we compute the binary values (represented by membership in 

# the diet vs. non-membership), then, using those values, we recompute 

# the profile, generating floating point values. This allows 

# us, in the final convergence, to generage clusters in such 

# a way that the most remote profile elements don't hold as great a sway 

# over what potential members are attracted to the cluster. 

# WHILE REVIEWING THIS CODE FOR WORK ON CLUSTERS13 , I NOTICED THAT THIS 

# APPARENTLY SHOULD BE STRUCTURED AS: IF BOOL_BINARY . . . ELSE . THIS WOULD 

# AVOID SETTING dictStrFloat_longnameRelatedness TWICE, AS APPARENTLY 

# HAPPENS WITH THE CURRENT CODE. NOT CHANGING NOW BECAUSE AM WORKING 

# ON NEW VERSION AND DO NOT EXPECT TO TEST CHANGES. 

for rel_ in self . lstRel_member : 
if rel_. class == Work: 

self .dictStrFloat_longnameRelatedness [ rel_. getName { ) ] =1.0 
elif rel_. class == Cluster: 

lstStr_otherName = rel_. getAssociatedLongnames { ) 

for str_otherName in lstStr_otherName : 
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self . dictStrFloat^longnameRelatedness [ s tr_o therName ] =1.0 
else : 

assert 0 # Attempt to cluster an illegal class 

if not bool_binarY: 

for rel__ in self . lstRel_meniber : 

if rel_. class -= Work: 

self .dictStrFloat_longnameRelatedness [ rel_. getName { ) ] = 
self . getOverallRelatedness { rel_ } 

elif rel„. class == Cluster: 

lstStr_ot her Name = rel_. getAssociatedLongnames ( } 
for str_otherName in lstStr_otherName : 

self .dictStrFloat_longnameRelatedness [ str_otherName ] = 
self .getOverallRelatedness { rel„ ) 
else : 

assert 0 # Attempt to cluster an illegal class 

def makeEmptY( self } : 

# Notice that it leaves the profile (self . dictStrFloat_longnameRelatedness } intact 

for purposes 

# of getAssociatedRelatedness ( ) and getAssociatedLongnames {) , 
self . lstRel_member = [] 

def merge ( self ) : 

# Turns a cluster of clusters (each of which must contain works) 

# into a cluster of works 

lstWork„ = [] 

for clus_ in self . lstRel_member : 

assert clus_. class Cluster 

for work_ in clus_ . getMembers ( ) : 

assert work_. class == Work 

1 s two rk_. append ( work_ } 
self , lstRel_member = lstWork_ 

class ClusterSet: 

def init { self, str_f ileName=None, lstClus_persis tent=None , 

int_randomClusterCount=None ) : 

# The constructor just loads or creates the clusters ^ it doesn't 

# do any processing, 

# When constructing from a file, the clusters 

# have profiles for measuring relatedness, but have no members. 

# When constructing from a list of clusters, they keep their members. 

# Randomly generated clusters are given members, 
self . lstClus_ = [] 
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if str_f ileName: 

readUserPlayStats { str_f ileName ) 

elif int_randomClusterCount : 

lstWork_ = Data . singleton. getWorks { ) 
int_countWorks = len( lstWork_ ) 
lstlnt_shuf f led = shuffle ( int_countWorks ) 

if int_countWorks < int_randomClusterCount : # Obviously only applicable in small 
tests . 

int_randomClusterCount = int_countWorks 
int_nuitiberOfRandomWorksPerC luster = int_countWorks / int_randomClusterCount 
clus_current = None 

for int_ in xrange { int_countWorks ) : 

if int_ % int_nuiTiberOfRandomWorks Per Cluster 0: 
if clus_current : #Skip first iteration 

clus_current . computeClusterProf ile ( bool_binary=l ) 
clus_current = Cluster () 
self -addToClusterSet { clus_current ) 
clus_current.addToCluster ( lstWork_[ lstlnt_shuf f led [ int_ ] - 1 ] ) 
clus_current. computeClusterProf ile ( booLbinary=l ) # May end up doing this 
twice for a cluster 
else : 

assert lstClus_persistent 
self .lstClus_ = lstClus_ 

def consolidateArtists ( self ): 

# Move all works for a given artist to the cluster with the greatest 

# concentration of works for that artist. 

# This may not be necessary in implementations where can do all clustering at 
artist level. 

dictStrDictClusInt_artistClusterCount = {} 

dict_ = dictStrDictClusInt_artistClusterCount # short handle 

for clus_ in self . lstClus„: 
lstWork_ = clus_.getMembers { ) 
for work_ in lstWork_: 

str_artist = work_. getArtist { ) 
if dict_.has_key ( str_artist ) : 

if dict_[ str„artist ] .has_key{ clus_ ) : 

dict_[str_artist ] [ clus__ ] = dict_[ str_artist ] [ clus_ ] + 1 
else : 

dict_[ str_artist ] [ clus_ ] = 1 
else : 

dict_[ str_artist ] = { clus_ : 1 } 
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dictStrClus_artistBestCluster = {} 

for str_artist in dict_. keys ( ) : 
clus_bestCluster ~ None 
int_bestCount = 0 

for tupClusInt_ in dict_[ str_artist ]. items (): 
if tupcluslnt„[ 1 ] > int_bestCount : 

int_bestCount = tupClusInt_[ 1 ] 

clus_bestC luster = tupClusInt_[ 0 ] 
dictStrClus_artistBestCluster [ str_artist ] = clus_bestCluster 

for clus„ in self . lstClus_: 
c lus„ . makeEmp ty ( ) 

dictStrLstWork_artistWork = {} 
for work_ in Data . singleton . getWorks () : 
str_artist = work_, getArtist { ) 

if dictStrLstWork_artistWork,has_key { str„artist ) : 

dictStrLstWork_artistWork[ str__artist ] . append { work„ } 
else : 

diets trLstWork_art is tWork[ str_artist ] = [ work_ ] 

for tupStrClus„ in dictStrClus_artistBestCluster . items () : 
str_artist = tupStrClus_[ 0 ] 
clus_ = tupStrClus_[ 1 ] 

for work_. in dictStrLstWork_artistWork [ str_artist ]: 
clus_. addToCluster { work_ ) 

for clus_ in self . lstClus_: 
clus_. computeClusterProf ile ( ) 

def getAverageSquaredUniqueArtist ( self ): 
int_sum = 0 

for clus_ in self . lstClus_: 

int_count = clus_. getCountUniqueArtist ( ) 
int_sum = int_s-am int_count * *2 . 0 

return f loat ( int_sum ) / len{ self . lstClus„ ) 

def getAverageCountUniqueArtist { self ) : 
int_sum = 0 

for clus_ in self . lstClus_: 

int_count = clus„. getCountUniqueArtist { ) 
int_sum - int_suin + int_count 

return f loat ( int_sum ) / len{ self . lstClus_ ) 
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def getMaxCountUniqueArtist ( self ): 
int_max = 0 

for clus_ in self . lstClus_: 

int_count = clus_. getCountUniqueArtist ( ) 

if int_count > int_max: 
int_niax = int_count 
return int_max 

def getMinCountUniqueArtist ( self ): 

int_min = len { Data . singleton . getWorks () ) 
for clus_ in self . lstClus_: 

int_count = clus_. getCountUniqueArtist { ) 
if int_count < int_min: 
int_min = int_count 
return int_inin 

def getMaxClusterSize ( self ): 
int_maxSize = 0 

for clus_ in self . lstClus_ : 

int_size = len( clus_. getMembers { ) ) 
if int_size > int_maxSize : 
int_maxSize = int_size 

return int_maxSize 

def getSignature { self ) : 

# Returns a dictionary which is a signature of the cluster 

# Convenient since diets can be tested for equality, don't need identity 
dictStrDictStrNone_longnameLongname = {} 

for clus_ in self . lstClus_: 

str_clusterName = clus_. getName ( ) 

dictStrDictStrNone_longnameLongname[ str_clusterName ] = {} 
for str_associatedLongname in clus_. getAssociatedLongnames ( ) : 

dictStrDictStrNone_longnameLongname[ str_clusterName ] [ str_associatedLongname ] 

- None 

return dictStrDictStrNone_longnameLongname 

def performClustering ( self, lstRel_item, bool_recluster=0 , bool_binary=l }: 

# bool_recluster means recluster items that are already clustered. 

# For defensive programming, we copy the list object (nothing in the list is 
copied) 

# so that, when we add to the list below, it doesn't have side effects 

# for calling methods which expect the list to be unmodified 
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lstRel_itemToCluster = copy. copy { lstRel_item ) 

if bool_recluster : 

for clus_ in self . lstClus_: 

for rel_ in clus_. getMembers { ) : 

lstRel_itemToCluster . append { rel_ ) 

for clus_ in self . lstClus_: 

clus_.makeEmpty ( ) # Leaves profile intact 

for rel_ in lstRel_itemToCluster : 

f loat_bestRelatedness =0.0 # default to no correlation 

clus_best = None 

for clus_ in self . lstclus._: 

f loat„currentRelatedness = clus_. getOverallRelatedness { rel__ ) 
if f loat_currentRelatedness > f loat_bestRelatedness : 
f loat_bestRelatedness = f loat_currentRelatedness 
clus_best = clus_ 
if float„bestRelatedness: # IF 0 DOES NOT GO INTO A CLUSTER!! 

clus_best . addToC luster { rel_ } 

clus_ . computeClusterProf ile { bool_binary ) # Prepare the cluster 

center for use in further correlation 

def convergeClusters ( self, f loat__latestTinie, int_ininimumIterations , booLbinary=l ): 
# f loat__latestTime is latest time to start an iteration 

f loat_currentTime = time . time { ) 

dict„oldSignature = None 

int_iterations - 0 

bool_done = 0 

while not bool_done: 

if int_iterations < int_minimumIterations or f loat_currentTime <= 
f loat_latestTime : 

print 'iterating:', int_iterations 

self .per formClustering ( [], bool_recluster=l , bool_binary=bool_binary ) 
dict_newSignature = self . getSignature ( ) 
if dict_newSignature == dict„oldSignature : 

print 'finishing convergence due to unchanged signatures' 

bool_done = 1 
else: 

dict_oldSignature = dict_newSignature 
f loat_currentTime = time . time ( ) 
int_iterations ~ int„iterations + 1 
else : 

print 'finishing due to timeout' 
bool_done = 1 
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def merge { self ) : 

for clus_ in self . lstClus_: 
clus_. merge ( ) 

def getClusters ( self ): 
return self . lstClus_ 

def addToClusterSet { self, clus__ ): 
self . lstClus_. append { clus_ ) 

def readUserPlayStats { self, str_fileName ): 

# We do not put members into the clusters, we only populate the profiles. 

self .lstClus_ = [] 

fil_ = open (str„f ileName, 'r') 

str_ = fil_.read() 

fil_. close ( ) 

class ClusterParser ( xmllib.XMLParser ): # Embedded class, only used here I 

def init ( self, clst_ ) : 

self.clst_ = clst_ 

self . clus_current = None 

xmllib.XMLParser. init ( self ) 

def start_member ( self, dict_ ): 

str_artist = intern ( dict_ [' artist ' ] ) 
str_title = intern ( dict_[ 'work' ] ) 
tupStrStr_artistTitle = ( str_artist, str_title ) 
str_longname = intern { ' %s - %s ' % tupStrStr„artistTitle ) 
self . clus_current, addToProf ile ( str__longname ) 

def start_cluster ( self, dict_ ): 

self . clus__current = Cluster ( dict_[ 'name' ] ) 
els t_. Is tClus_. append { clus_current ) 

parser_ = ClusterParser { self ) 
parser_. f eed( str_ ) 
parser_ . close { ) 

def writeToDisk{ self, str_fileName ) : 
fil_ = open { str_f ileName, 'w' ) 

fil_.write( '<?xml version^ " 1 . 0 " encoding= " ISO-8859-1 " ?>\n ' ) 

fil_. write ( <ClusterContainer xmlns : xsi= "http : / /www.w3 . org/ 1999 /XMLSchema- 

instance " 

xsi :noNamespaceSchemaLocation= ' ViewListContainer . xsd ' >\n" " " ) 
f il_. write { ' <clusters medium- "music " >\n ' ) 
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for clus_ in self . lstClus_: 

fil_.write( ' <c luster name= " %s " >\n ' % clus_.getNairie ( ) ) 

lstTup_ = [] 

for work_ in clus_. getMembers ( ) : 

f loat_relatedness = clus_ . getOverallRelatedness { work_ ) 

tup_ = { f loat_relatedness, toXML( work_. str_artist ), toXML { work_. str_work )) 

lstTup_. append ( tup_ ) 
lstTup_. sort { ) 
lstTup_. reverse ( ) 
for tup_ in lstTup_: 

f il_. write ( ' <member artist="%s" work="%s" relatedness= " %s " />\n' % 

( tup_ [ 1 ] , tup_ [ 2 ] , tup_ [ 0 ] ) ) 

fil_.write( ' < /cluster>\n ' ) 

f il_. write ( ' </clusters>\n ' ) 
f il_. write ( ' </ClusterContainer>\n ' ) 
fil_. close ( ) 



###################################################################################### 
######## 

# SCRIPT LOGIC 

try: 

ConfigO # Get configuration data 
Data{) # Create data singleton 

if Conf ig . int_createClusterCount : 

# See http : / /www .math . tau . ac . il/~'nin/learn98 /idomil/ 

int__nuiriberOf Clusters = int { Conf ig , int_createClusterCount * math, log ( 
Conf ig . int_createClusterCount }) 

f loat_maxTime = time . time ( ) + Conf ig . f loat_maxTime 

f loat_mostFabulous = float( len{ Data . singleton. getWorks () ) * len ( 
Data . s ingleton . getWorks { ) ) ) 

while time . time { ) < f loat_maxTime : 

f loat_maxTimel = { f loat__maxTime - time.timeO) *.33 + time, time {) 
f loat_maxTime2 = { f loat„raaxTime - time.time{)) *.66 + time. time {) 
f loat_maxTimel = { f loat_maxTime - time.timeO) *.50 + time.timeO 
f loat_maxTime2 = ( f loat_maxTime - time.timeO) *1.0 + time.timeO 
print 'In outer loop ######' 
print 'about to make cluster set' 

clst_l = ClusterSet ( int_randomC lusterCount= in t_numberOf Clusters ) 
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print 'about to perform first clustering' 

els t_l.perfontiClus taring ( [], 1 } 

print 'about to perform first convergence' 

clst_l . convergeClusters { f loat_maxTimel , Conf ig . int_minimuinConvergenceIterations 

) 

lstClus_l = clst_l.getClusters {) 

clst_2 - ClusterSet { int_randoinClusterCount=Conf ig , int_createClusterCount ) # A 
set of clusters of clusters 

print 'about to perform second clustering' 

clst_2 .perf ormClustering { lstClus_l, 0 ) # Make clusters of clusters 

print ' about to merge ' 

c Is t_2 .merge { ) # Change from clusters of clusters to clusters of works 

print ' about to perform second convergence ' 

clst_2 . convergeClusters ( f loat_maxTime2 , Conf ig . int_minimTJimConvergence Iterations 

) 

clst_2 .perf ormClustering { [], 1, bool_binary=0 ) 
print 'about to perform third convergence' 

clst_2 .convergeClusters { f loat_maxTimel , Config. int_rainimumConvergenceIterations , 
bool_binary=0 ) 

f loat_f abulousness - clst_2 . get Aver age SquaredUnique Artist { ) 

print 'max unique:', clst_2 . getMaxCountUniqueArtist { ) , ' min unique:', 
clst_2 . getMinCountUni que Artist { ) 

print ' avg unique: ' , clst_2 . getAverageCountUniqueArtist { ) , ' fabulousness: ' , 
float_f abulousness 

if float_f abulousness < f loat_mostFabulous : 
fil_ = open { ' tuninginf o . txt ' , 'w') 

f il_.wri te {' float_tuningRankBot torn: ' + str( Conf ig , f loat_tuningRankBottom ) + 

'\n' ) 

fil_. write (' f loat_tuningZeroWeight : ' + str( Conf ig . f loat_tuningZeroWeight ) + 

'\n' ) 

fil_. write {' f loat_fabulousness : ' + str( float„f abulousness ) + ' \n ' ) 

f il_. write (' clst_2 .getMaxCountUniqueArtist 0 : ' + str( 
clst_2 . getMaxCountUniqueArtist ( ) ) + '\n') 

f il_. write (' clst_2 .getMinCountUniqueArtist () : ' + str( 
clst_2 . getMinCountUniqueArtist { ) ) + '\n') 

f iL. write (' els t_2 . getAverageCountUniqueArtist {) : ' + str( 
clst_2 .getAverageCountUniqueArtist { ) ) + ' \n' ) 

fil_. close { ) 

print ' ###FOUND NEW BEST### ' 

print 'writing intermediate' 

f loat„mostFabulous = float_f abulousness 

clst_2 . writeToDisk ( ' intermediate . xml ' ) 

clst_best = clst_2 

elif Conf ig . str_clusterFile : 

clst_cluster = ClusterSet { str„f ileName=Conf ig . str_clusterFile ) 
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clst_cluster .perf ormClustering ( Data . singleton . getWorks () , 0 ) 
clst_actual . convergeClusters ( Conf ig . f loat_maxTime + time. time (] 
Conf ig . int_minimumConvergence Iterations ) 
else : 

5 assert 0, 'Invalid config file option' 

clst_best .writeToDisk( Conf ig . str__outClusterFile ) 
print ' done ! ' 
except Exception, str_: 
print 'ERROR' 
10 print str_ 

print ' \n\nPress any key to abort:' 
sys . s tdin . read ( 1 ) 
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CLAIMS 

1. A system comprising clusters of works ordered so that the works in a given cluster are 
selected to be consistent with a particular set of human tastes, such system comprising: 

5 An input mechanism wherein data is collected for use in optimizing said clusters; 

A software mechanism for determining whether a particular possible change in the 
clustering would result in an improvement to the clustering; 

Input facilities for possible changes to be suggested; 

Facihties for implementing accepted changes; 
10 A display mechanism whereby users may observe the cluster membership. 

2. The method according to claim 1, wherein the software mechanism for determining whether 
a particular possible change in the clustering would result in an improvement to the 
clustering is based upon information transfer calculations as described in the theory of 
Shannon entropy. 

15 3. The method according to claim 1, wherein the input facilities for possible changes to be 
suggested comprises an HTML interface for humans to suggest changes, wherein said 
humans may be using multiple machines connected via the Internet, 
4. The method according to claim 1, wherein the input facilities for possible changes to be 
suggested accepts machine-generated suggestions. 

20 5. The method according to claim 4, wherein wherein the input facilities for possible changes to 
be suggested accepts suggested generated by remote machines connected via the Internet. 
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ABSTRACT 

The invention involves clusters or hubs each comprising multiple works for which human beings 
might express taste-based preferences. 

5 

The items are grouped in clusters in such a way that the works most in accordance with the tastes 
of any particular individual person will tend to be in a small number of these clusters out of the 
overall collection. In this way, clusters can be used to help the person find items that he is not 
previously familiar with but that he will probably like. 

10 

The clustering of works is optimized by human effort, software, or both. By way of example, a 
methodology for doing this using the principle of information transfer as described in the theory 
of Shannon entropy is explained. When human effort is used to perform the optimization, 
facilities are provided for using such principles to determine whether a human-suggested change 
15 actually improves the clustering. 

Facilities are provided whereby the optimization work may be distributed across multiple 
machines. 

20 Facilities are provided whereby artists may introduce new works to the system and quickly make 
them known to the people who are likely to enjoy them. Facilities are provided whereby users 
can easily receive recommendations for works they are likely to enjoy. 
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